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Foreword 



The goal of establishing an international conference on Intelligent Data Engi- 
neering and Automated Learning (IDEAL) is to provide a forum for researchers 
and engineers from academia and industry to meet and to exchange ideas on the 
latest developments in an emerging held, that brings theories and techniques 
from database, data mining and knowledge discovery, statistical and computa- 
tional learning together for intelligent data processing. The efforts towards this 
goal have been supported greatly from colleagues both in the Asia Pacihc region 
and all over the world, and further encouraged by the success of IDEAL98. A 
signihcant development was achieved in IDEAL 2000 this year, which is evi- 
denced not only by an expansion of the major tracks from two to three, namely, 
Einancial Engineering, Data Mining, and Intelligent Agents, but also by a con- 
siderable increase in the number of submissions and a high quality technical 
program. This achievement comes from the efforts of the program and orga- 
nizing committee, with a large number of supporters all over the world. It was 
their hard work, often over sleepless nights, that brought about the successful 
conference. We would like to take this opportunity to thank them all. Their 
names and afhliations are shown in the symposium program. 

We especially want to express our appreciation to the staff of The Chinese 
University of Hong Kong for their boundless contributions to this conference, 
particularly to Prof. Lai-wan Chan and Prof. Kwong Sak Leung as the Program 
Co-chairs and Prof. Jimmy Lee as the Organizing Chair. We thank the members 
of the International Program Committee, without whom we could not guaran- 
tee the high quality of the papers. Members of the Organizing Committee are 
instrumental behind the scene. Prof. Irwin King and Prof. Evangeline Young 
did a superb job in local arrangement. Prof. Wai Lam took care of the regis- 
tration process, and, last but not least. Prof. Helen Meng ensured the smooth 
publication of the conference proceedings. 

Moreover, we would like to thank Prof. Michael Dempster, Prof. Nick Jen- 
nings, Prof. Wei Li, and Heikki Mannila for their support as keynote speakers, 
bringing us the latest developments and future trends in the emerging Reids, 
and also Prof. Zhenya He and Prof. Weixin Xie for organizing a special panel 
session, providing an insight into recent advances in the field in China, 

Lastly, we hope you enjoyed your stay in Hong Kong and at The Chinese 
University. 



October 2000 



Pak-Chung Ching and Lei Xu 
General Co-chairs 

The Chinese University of Hong Kong 




Preface 



Data Mining, Financial Engineering, and Intelligent Agents are emerging fields in 
modern Intelligent Data Engineering. In IDEAL 2000, these fields were selected 
as the major tracks. IDEAL 2000 was the Second International Conference on 
Intelligent Data Engineering and Antomated Learning, a series of biennial con- 
ferences. This year, we received over one hnndred regular snbmissions and each 
paper was vigoronsly reviewed by experts in the field. We trnly appreciate the 
work done by the reviewers. Some reviewers wrote lengthy and constrnctive com- 
ments to the authors for improving their papers. The overall program covered 
varions topics in data mining, financial engineering, and agents. We also had 
a nnmber of papers applying the above techniqnes to internet and multimedia 
processing. 

We wonld like to thank onr keynote speakers and the organizers of the special 
sessions and panel session. For Keynote talks, 

• Professor M.A.H. Dempster, University of Cambridge, UK, gave a keynote 
talk on “Wavelet-Based Valuation of Derivative” , 

• Professor Nick Jennings, University of Sonthampton, UK, gave a keynote 
talk on “Antomated Haggling: Building Artificial Negotiators”, 

• Professor Wei Li, Beijing University of Aeronautics and Astronautics, 
China, gave a keynote talk on “A Compntational Framework for Con- 
vergent Agents” , and 

• Professor Heikki Mannila, Helsinki University of Technology, Finland, gave 
a keynote talk on “Data Mining: Past and Fntnre” . 

Apart from the regular snbmissions, we also had two special sessions and a panel 
session at the conference. 

• Professor Shn-Heng Chen of the National Chengchi University and Pro- 
fessor K.Y. Szeto of Hong Kong University of Science and Technology 
organized the special session on “Genetic Algorithms and Genetic Pro- 
gramming in Agent-Based Compntational Finance”. 

• Dr. Yin-Ming Chenng of The Chinese University of Hong Kong organized 
the special session on “Data Analysis and Financial Modeling”. 

• Professor Zhenya He of the Sontheast University and Professor Weixin Xie 
of the Shenzhon University organized a panel session on “Intelligent Data 
Engineering Antomated Learning : Recent Advances in China” . 

We wonld like to express onr gratitude to onr general chairs. Professors Pak- 
Chnng Ching and Lei Xn for their leadership and snpport. We appreciate and 
thank the Organizing and Program Committee members, for their devotion in 
the organization of the conference, and the reviewing of the papers; in particular. 
Professor Jimmy Lee, the Organizing Chair of IDEAL 2000, for his great effort in 
the organization of the conference thronghont, and Professors Irwin King, Helen 
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VII 



Meng, Wai Lam, and Evan F. Y. Yonng for their time, effort, and constrnctive 
snggestions. We wonld also like to thank the snpporting staff of the Department 
of Computer Science and Engineering of the Chinese University of Elong Kong 
for varions help. Last bnt not the least, we thank Chnng Chi College for the 
sponsorship of the conference. 

October 2000 Kwong-Sak Lenng and Lai-Wan Chan 

Program Co-chairs 

The Chinese University of Hong Kong 
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Clustering by Similarity in an Auxiliary Space 



Janne Sinkkonen and Samuel Kaski 

Neural Networks Research Centre 
Helsinki University of Technology 
P.O.Box 5400, FIN-02015 HUT, Finland 
janne. sinkkonen Qhut.fi, samuel.kaski @hut.fi 

Abstract. We present a clustering method for continuous data. It de- 
fines local clusters into the (primary) data space but derives its similarity 
measure from the posterior distributions of additional discrete data that 
occur as pairs with the primary data. As a case study, enterprises are 
clustered by deriving the similarity measure from bankruptcy sensitivity. 
In another case study, a content-based clustering for text documents is 
found by measuring differences between their metadata (keyword dis- 
tributions). We show that minimizing our Kullback-Leibler divergence- 
based distortion measure within the categories is equivalent to maximiz- 
ing the mutual information between the categories and the distributions 
in the auxiliary space. A simple on-line algorithm for minimizing the 
distortion is introduced for Gaussian basis functions and their analogs 
on a hypersphere. 



1 Introduction 

Clustering by definition produces localized groups of items, which implies that 
the results depend on the used similarity measure. We study the special case 
in which additional, stochastic information about a suitable similarity measure 
for the items G R" exists in the form of discrete auxiliary data c^. Thus, 
the data consists of primary-auxiliary pairs {xu^Cu). In the resulting clusters the 
data items x are similar by the associated conditional distributions p{c\x). Still, 
because of their parameterization, the clusters are localized in the primary space 
in order to retain its (potentially useful) structure. The auxiliary information is 
only used to learn what distinctions are important in the primary data space. 

We have earlier explicitly constructed an estimate p{c\x) of the conditional 
distributions, and a local Riemannian metric based on that estimate [5]. Metrics 
have additionally been derived from generative models that do not use auxiliary 
information [3,4]. Both kinds of metrics could be used in standard clustering 
methods. In this paper we present a simpler method that directly minimizes the 
within-cluster dissimilarity, measured as distortion in the auxiliary space. 

We additionally show that minimizing the within-cluster distortion maxi- 
mizes the mutual information between the clusters and the auxiliary data. Max- 
imization of mutual information has been used previously for constructing rep- 
resentations of the input data [1]. 

In another related work, the information bottleneck [7,9], data is also clus- 
tered by maximizing mutual information with a relevance variable. Contrary to 

K.S. Leung, L.-W. Chan, and H. Meng (Eds.): IDEAL 2000, LNCS 1983, pp. 3-8, 2000. 
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our work, the bottleneck treats discrete or prepartitioned data only, whereas we 
create the categories by optimizing a parametrized partitioning of a continuous 
input space. 

2 The Clustering Method 

We cluster samples a; G K” of a random variable X. The parameterization of 
the clusters keeps them local, and the similarity of the samples is measured as 
the similarity of the conditional distributions p{c\x) of the random variable C. 

Vector quantization (VQ) is one approach to categorization. In VQ the data 
space is divided into cells represented by prototypes or codebook vectors rrij, 
and the average distortion between the data and the prototypes, 

= XI / yj{x)D{x,mj) p(x) dx , (1) 

j •' 

is minimized. Here D{x,mj) denotes a dissimilarity between x and rrij, and 
yj{x) is the cluster membership function for which 0 < yj{x) < 1 and J2j Vji^) = 
1. In the classic “hard” VQ the membership function is binary valued: yj(x) = 1 
if D{x,mj) < D{x,mi), \/i, and yj(x) = 0 otherwise. In the “soft” VQ, 
the yj(x) attain continuous values and they can be interpreted as conditional 
densities p{vj\x) = yj{x) of a discrete random variable V that indicates the 
cluster identity. Given x, C and V are conditionally independent: p(c,v\x) = 
p{c\x)p{v\x). It follows that p{c,v) = J p{c\x)p{v\x)p{x)dx. 

Our measure of dissimilarity is the Kullback-Leibler divergence, defined for 
two multinomial distributions with event probabilities {pi} and {g*} as 
DKhiPijQi) = log(Pi/®)- III om" case, the first distribution corresponds 

to the data x: pi = p{ci\x). The second distribution will be the prototype. It 
can be shown that the optimal prototype, given that the values of the yj{x) 
are fixed, is qj = p{ci\vj) = p{ci,Vj)/p{vj). By plugging this prototype and the 
Kullback-Leibler distortion measure into the error function of VQ, equation (1), 
we get 

Kkl = X / yj{x)Dyii^{p{c\x),p{c\vj))p{x)dx . (2) 

j 

Instead of computing the distortion between the vectorial samples and vectorial 
prototypes as in (1), we now have pointwise comparisons between the distri- 
butions p{c\x) and the indirectly defined prototypes p{c\vj). The primary data 
space has been used to define the domain in the auxiliary space that is used for 
estimating each prototype. 

If the membership functions are parametrized by 9 the average distortion 
becomes 

= -^logp(c,|«,) / yj{x-,0)p{c,,x) dx + const., 



(3) 
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where the constant is independent of the parameters. Note that minimizing 
the average distortion i?KL is equivalent to maximizing the mutual information 
between C and V, because Ekl = —I{C\ V) + const. 

The choice of parameterization of the membership functions depends on the 
data space. For Euclidean spaces Gaussians have desirable properties. When the 
data comes from an n-dimensional hypersphere, spherical analogs of Gaussians, 
the von Mises-Fisher (vMF) basis functions [6] are more approriate. Below we 
derive the algorithm for vMF’s; the derivation for Gaussians is analogous. 



Von Mises-Fisher Basis Funetions. A normalized n-dimensional vMF basis 
function is defined for normalized data by 



Vjix) 



M(x\ Wj) 
M{x-Wk) 



where M{x\Wj) 



K2" 1 






exp K- 



T 

X Wj 






( 4 ) 



where /^(k) denotes the modified Bessel function of the first kind and order r. 
The dispersion parameter k is selected a priori. With the vMF basis functions 
the gradient of the average distortion (3) becomes 






i 



P{Ci\Vj) 

p{ci\vi) 



j {x - Wjwjx)yj{x)yi{x)p{ci,x)dx , 



( 5 ) 



where the Wj are assumed normalized (without loss of generality). 



An on-line Algorithm can be derived using yj{x)yi{x)p{ci,x) = p{vj,vi,Ci,x) 
as the sampling function for stochastic approximation. The following steps are 
repeated with a{t) gradually decreasing to zero: 

1. At the step t of stochastic approximation, draw a data sample {x{t),Ci{t)). 

2. Draw independently two basis functions, j and I, according to the probabil- 
ities {yk{x{tj)}. 

3. Adapt the parameters Wj according to Wj{t -b 1) = Awjl\\Awj\\, where 

Awj = Wj(t) V a(t) log ix{t) - Wj{t)wj{tfx{t)) , (6) 

and a{t) is the gradually decreasing step size. The p are estimates of the 
conditional probabilities. The parameters wi can be adapted at the same 
step, by exchanging j and I in (6). 

4. Adapt the estimates p{ci\vj) with stochastic approximation, using the ex- 
pression 



P{ci\vj){t -b 1) = (1 - X{t))p{ci\vj){t) + X{t) 
p{ck\vj){t-\-l) = (1 - X{t))p{ck\vj){t) ,k^i 

where the rate of change X{t) should be larger than a{t). In practice, 2a{t) 
seems to work. 
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3 Case Studies 

We applied our model and two other models to two different data sets. The other 
models were the familiar mixture model p(a;) = p{x\j)P{j), and the mixture 

discriminant model p(cj, a;) = J2j P{(^i\j)Pi^\j)Pij) (MDA2 [2]). The P{j) are 
mixing parameters, and the P{ci\j) are additional parameters that model class 
distributions. 

Clustering of text doeuments is useful as such, and the groupings can addi- 
tionally be used to speed-up searches. We demonstrate that grouping based on 
textual content, with goodness measured by independent topie information, can 
be improved by our method utilizing (manually constructed) metadata (key- 
words). Thus, in this application our variable C corresponds to the keywords, 
and the variable X represents the textual content of the documents, encoded 
into a vector form. 

Model performance was measured by the mutual information between the 
generated (soft) categories and nine topie elasses, such as nuclear physics and 
optics, found independently by informaticians. 

We carried out two sets of experiments with different preprocessing. The 
von Mises-Fisher kernels (4) were used both in our model and as the mixture 
components p{x\j) = M{x\ Wj). To encode the textual content, the words in the 
abstracts and titles were used, converted to base form. The rarest words were 
discarded. Documents with less than 5 words remaining after the preprocessing 
were discarded, resulting in about 50,000 data vectors. 

The first experiment utilized no prior relevance information of the words: we 
picked 500 random words and encoded the documents with the “vector space 
model” [8] with “TF” (term frequency) weighting. In the second experiment 
more prior information was utilized. Words belonging to a stop-list were re- 
moved, and the “TF-IDF” (term frequency times inverse document frequency) 
weighting was used. In the first experiment with ’random’ feature selection, our 
method performed clearly better than the other models. With the improved 
feature extraction the margin reduced somewhat (Fig. 1). 

Clustering enterprises by bankruptey sensitivity. We clustered financial state- 
ments of small and medium-sized Finnish enterprises by bankruptcy sensitivity, 
a key issue affecting credit decisions. The data set consisted of 6195 financial 
statements of which 158 concerned companies later gone bankrupt. Multiple 
yearly statements from the same enterprise were treated as independent sam- 
ples. 

We compared the MDA2 with our model. The basis functions M{x\Wj) of 
both models were Gaussians parametrized by their location, with the covariance 
matrices a priori set to cr^/. Measured by the mutual information, our model 
clearly outperformed MDA2 (Fig. 2. Note that it is not feasible to estimate our 
model with the straightforward algorithm presented in this paper when a is 
very small. The reason is that the gradient (5) becomes very small because of 
the products yj{x)yi{x)). 
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a 



b 



Fig. 1. Mutual information between the nine document clusters and the topics (not 
used in learning), a Random feature extraction, dimensionality: 500. b Informed feature 
extraction, dimensionality: 4748. Solid line: our model, dashed line: MDA2, dotted line: 
mixture model. (Due to slow convergence of MDA2, it was infeasible to compute it for 
part b. Another comparison with MDA2 is shown in Fig. 2) 



4 Conclusions 



We have demonstrated that clusters obtained by our method are more informa- 
tive than clusters formed by a generative mixture model, MDA2 [2], for two kinds 
of data: textual documents and continuous-valued data derived from financial 
statements of enterprises. In (unpublished) tests for two additional data sets the 
results have been favorable to our model, although for one set the margin to 
MDA2 was narrow compared to the cases presented here. 

For the first demonstration with textual documents, it would be interesting 
to compare the present method with the information bottleneck [7, 9] and met- 
rics derived from generative models [3]. For the continuous data of the second 
experiment the bottleneck is not (directly) applicable. A generative model could 
be constructed, and we will compare our approach with such “unsupervised” 
generative models in subsequent papers. 

When the feature extraction was improved using prior knowledge, the mar- 
gin between our method and the “unsupervised” mixture model reduced. This 
suggests that our algorithm may be particularly useful when good feature extrac- 
tion stages are not available but there exists auxiliary information that induces 
a suitable similarity measure. 
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Fig. 2. Mutual information between the posterior probabilities of the ten enterprise 
clusters and the binary bankruptcy indicator. Solid line: our model, dashed line: MDA2. 
A set of 25 financial indicators was used as the primary data. The binary variable C 
indicated whether the statement was followed by a bankruptcy within 3 years 
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Abstract In generalised lotto-type competitive learning algorithm more than 
one winner exist. The winners are divided into a number of tiers (or divisions), 
with each tier being rewarded differently. All the losers are penalised (which 
can be equally or differently). In order to study the various properties of the 
generalised lotto-type competitive learning, a set of equations, which governs 
its operations, is formulated. This is then used to analyse the stability and other 
dynamic properties of the generalised lotto-type competitive learning. 



1 Introduction 

Recently, there is strong interest in deploying various techniques, such as neural 
networks, genetic and evolution algorithms, modem statistical methods and fuzzy 
logics, in financial analysis, modelling, and prediction [1]. Some of these techniques 
utilise competitive learning to locate features that are essential in financial market 
modelling and prediciton. The focus of this paper is to analyse a new class of 
competitive learning paradigm. We begin by introducing the classical view of 
competitive learning. Each input prototype X; (i = 1 , 2, . . . , M, and M is the number of 
prototypes available for training) will activate one member of the output layer (i.e. the 
winning node or neuron, say node c, such that c = arg minj ||Xi - qlf) and its 
corresponding (long-term) weight vector Ok (assumed to be of the same 
dimensionality, say d, as Xj) is being updated by rjc(Xi - ei), where rj^ is the learning 
rate of the winning node and 0 <rh^< 1. The rest of the nodes (i.e. j c, j = 1 , 2, . . ., 
k, where k is the number of output nodes) in the output layer are not activated and 
their corresponding weight vectors (cq) are not modified. This form of learning is also 
referred to as wlnner-take-all (WTA) learning [2-4]. For correct convergence, 7]c 
should reduce with time /. The description given above is the sorting method of 
implementing the WTA learning. 

Alternatively, it can be implemented via a synchronous or an asynchronous 
network. In both cases, it usually comprises of (a) a competitive network which 
includes z kby k weight matrix and k competitive nonlinear pu|tput nodes, and (b) a 
matching network which contains zdhyk weight matrix. The matching network is for 
long-term memory (i.e. the weight vectors <q in the sorting method) and the 
competitive network is for short-term memory. In WTA type learning, after an input 
sample is presented, only one node (the winner) in the competitive network will 
remain active (or switch on) after a number of iterations (or a single iteration, 
depending on which type of network is used) and its corresponding long-term memory 
will be updated. In the past, most of the research efforts have been concentrated in the 

K.S. Leung, L.-W. Chan, and H. Meng (Eds.): IDEAL 2000, LNCS 1983, pp. 9-16, 2000. 
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behaviour of the short-term memory [2-4]. There are many different ways of 
implementing the short-term memory. Usually, for each of the k nodes, it will be 
excited by its own weighted output and inhibited by a weighted sum of either its 
neighbours or other active nodes. The weights for the inhibition signals are usually of 
the same value (though may be adaptively modified) and usually smaller than the 
weight of the single excitation signal. These were reviewed in [5, 6]. 

Currently, there are some efforts in studying the behaviour of the long-term 
memory. It has been known for a long time [2-4] that some long-term memory of the 
output nodes simply cannot learn. These nodes are usually refered to as dead units. It 
is found that a sensitivity or conscience parameter y [7] (which can be implemented as 
a winner frequency counter) can be included to prevent such dead units (or nodes) 
from occurring. The sensitivity parameter counts the number of times that a certain 
node wins. If that node wins more frequently than other nodes, it will reduce its 
chances from winning in future iterations. Thereby, giving other less frequently won 
nodes opportunities to learn. Unfortunately, dead imits may still arise if (a) the 
number of output nodes is more than the number of clusters or (b) the nodes are 
unfairly initialised. In these cases, the dead units will usually position themselves near 
the boundaries of the desired clusters or somewhere around their initial locations. 

Xu, et al. in [8] proposed to solve this problem by penalising (the long-term weight 
vector of) the nearest rival node. The update rule for the rival node is similar to the 
winner, except the de-leaming rate is a small negative real number and \rir\ « l??c!. 
For correct convergence, it has been shown it is necessary that (a) 8 < \riJrij\ < 15 and 
(b) the number of ouput nodes k cannot exceed twice the number of clusters [9]. This 
then forms a sort of “push-pull” action on the output nodes and drives the right 
number of output nodes towards the relevant cluster centres; at the same time, the 
excess or extra nodes are being pushed away from the area of interest. Variants that 
are based on either finite mixture or multisets modelling have been proposed [10]. 
Equally, we proposed that the rival can be rewarded, or it can be randomly rewarded 
or penalised, and the same “push-pull” action has been observed [11]. (Another 
variant scheme is proposed in [12] where the weights of the winners for the current 
and previous iterations are updated if and only if their status have changed. In this 
case, the current winner is rewarded and the previous winner is penalised.) 

More recently, we propose a new class of competitive learning that is based on 
models of the lottery game. The basic idea of lotto-type competitive learning is that 
the (long-term weight vector of the) winner is rewarded as in the classical competitive 
learning and ALL the losers (long-term weight vectors) are penalised (which can 
either be equally or unequally) [13]. This idea is later extended to include more than 
one winner - the generalised lotto-type competitve learning [14]. It is shown 
experimentally that such learning strategies can produce very similar results as that of 
the rival-penalised competitive learning for both closely and sparsely spaced cluster 
structures - that is, extra nodes are pushed away from the clusters of interest. 

It is also shown in [14] that there are some similarities between the generalized 
lotto-type competitive learning (LTCL) and the competitively inhibited neural network 
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(CINN) [15], Such similarity is exploited to formulate a set of dynamic LTCL 
equations. This set of equations will be outlined in section 2 and major results on the 
dynamic behaviour of these equations will be given in section 3. Concluding remarks 
will be given in section 4. 

2 The Lotto-type Competitive Learning Equations 

Both the original and generalised lotto-type competitive learnings are first 
implemented using the sorting method [13-14]. In these cases, only the long-term 
memory updating rules are of relevance. However, in order to study the other basic 
properties of the algorithms, it is useful to Incooperate some form of short-term 
memory into the learning paradigm. We have chosen the form proposed in [15] 
because it is one of the few models that incorporates both the short-term and long-term 
memory in its analysis. The model is simple, elegant and well studied [15]. However, 
we would like to modify it so that it reflects the lottory gaming idea not only in the 
equation that governs the long-term memory but also in those equations that control 
the short-term memory behaviour. 

Formally, the set of lotto-type competitive learning (LTCL) equations is defined as 
[14]: 



yj = -yj+4-*i.®j)+^j 


0) 


=/(yihj(-*:i-fl)i) 


(2) 


^ [- 1 otherwise 


(3) 


^j=«/(yj)+^ I /(ym) 


(4) 






Readers may note that (1) and (2) are the short-term memory (STM) and long-term 
memory (LTM) state equations, respectively. Here, E is defined as a bounded 
positive decreasing external stimulus function, / is defined as an internal stimulus 
function, y as the node STM state, with a and as some positive real constants. 
(Note that a and are the two possible values in the STM weight matrix.) The dot 
above the variable denotes the time derivative of that variable. The function /in (3) is 
a bipolar switching function. This is a major departure from the set of dynamic 
equations of the competitively inhibited neural network (CINN). In effect, this 
equation states that if the node is one of the winners, its neuronal activity (or 
excitation level) y must be greater than zero. However, if the node belongs to one of 
the losers, it neuronal activity must be less than or equal to zero. It follows that the 
inhibitions in (4) are given by the losers, and the excitations are contributed by the 
winners. The weight vector (of the LTM state) will be updated by (2), which is the 
difference between the input and the current value (or state) of the weight vector 
multiplied by its corresponding learning rate. The reward or penalisation of that 
weight vector (i.e. the sign of the updating rule) is automatically given by the flmction 
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f. (This is the main difference between classical and lotto-type competitive learnings. 
Here, we also break away from most of the models [2-3] of the neuronal activities, 
where they are assumed to be self excitation with lateral inhibition. The proposed 
equations represent a more complicated model of neuronal activities. In our case, the 
neuronal activity is governed by the neuronal states of a group of k neurons which 
provide both excitation and inhibition signals.) 

In [14] we show that the short-term memory of the proposed LTCL equations will 
converge stably. We first define the ^-orthant for the set of STM states as: 

Definition 1 Let be a finite subset of integers between 1 and k. The A-orthant of 
the set of STM states is 

{ve 9?*|(yj >0if j€ a) and(yj ^Oif jg a)} (5) 

From this definition, a number of lemmas are given and we arrive at the following 
theorem and corollary; 

Theorem 1 There exists a A-orthant of the LTCL’s state space which is an attracting 
invariant set of the network’s flow. 

Corollary 1 The flow generated by the LTCL equations converges to a fixed point. 

These theorem and corollary mean that once an ordered set has been formed among 
the output nodes, the short-term memory will converge to a point in the sample data 
space. In this paper, we continue to examine the dynamics of the LTCL equations 
according to [15]. 

3 Dynamics of the Lotto<type Competitive Learning Equations 

In the competitively inhibited neural network (CINN), it is the sliding threshold (£ > 
p/3, where p is the number of activated nodes) that determines the number of active 
neurons. In lotto-type competitive learning, a similar threshold can be derived to 
determine the number of winners. It is assumed that the same “STM initialization” is 
valid in LTCL: i.e. (a) the initial LTM states all have unequal external stimuli for a 
given input Xj, and (b) the initial STM states are all equal and negative. The following 
theorem can then be derived. 

Theorem 2 Assume that (a) the levels of the initial external stimuli are all unequal, 
(b) the STM initialization condition is valid, and (c) there exists a time after which 
there are p winners in the network, then the initial stimulus level of the next winning 
node must exceed the following threshold. 

£(Xi,aj(0))>a+(^-2p-l)]3. (6) 

Thus the first winner must have an Initial external stimulus exceeding a+(k-3)P. 
The second one must exceed a+ (k - 5)13 and so on. Finally, the last winner must 
have an initial external stimulus level which exceeds a. Clearly, the minimum number 
of nodes in any LTCL algorithms must be three. It is easy to verify that the maximum 
number of winners is given by the following theorem. 




Analyses on the Generalised Lotto-Type Competitive Learning 13 



Theorem 3 The number of winners, p, which can be supported by LTCL must satisfy 
the following inequality, 

p<(k-\)l2=p^. (7) 

One interesting consequence of the bipolar function in (3) is that the solution to (2) is 
given by 

fi}(t) = tq(0) exp{T?jXyj)} + (1 - exp{7?jXyj)}) x;. (8) 

If the node is a winner, the term inside the exponent will be negative (i.e. = rjjXyj) < 

0). Thus as t -4 oo (and, in general, 7]j 0 by definition), the weight vector q Xi. 

On the other hand, if the node is a loser, the term inside the exponent will be positive. 
The weight vector will not converge to the input. 

The following analysis is adapted from [15]. In that study, Lemmon examines the 
collective movement of all the output nodes within a finite presentation interval. The 
objective is to investigate conditions by which such movement will converge to the 
source density of the input data set. His idea is first to develop equations which 
govern the movement of the output node, i.e. the neural flux J(oi) (in his terminology). 
With these equations, conditions can then be established which guarantee the correct 
convergence of the output nodes to the source density. Since the analysis is 
essentially similar to [15], we will only state the results that are useful in our study. 

Assume that the neural density n(©) is a generalised function in the LTM space. 
One important result in [15] is that the neural flux follows the conservation law, i.e. 

a «(©) / at = - V ■ y(ffl) . (9) 

Now we can present a lemma that relates to the presentation interval, external stimulus 
and neural density. 

Lemma 1 Let x be a given input and define the activation interval, /(x), of x as that 
set of winning nodes due to x. The activation interval of x is the interval, 7(x) = (x-5, 
x+d), where 5 satisfies the following equation, 

s 

£(x, x+^ = [a+ (k - 1))3] - 2kP J «(x + u) dt) . (10) 

-s 

For convenient, we can define ^=a+ (k-l)P and A = 2^/3. Within this interval, all 
the nodes will be winners and their corresponding weight vectors will be rewarded. 
Outside this Interval, the nodes will be penalised. 

Definition 2 A first order point, cq of the LTM space as a point where the neural 
density is constant over the interval (o- 2^ 0 )+ 25). 

With these definitions, a theorem can be stated concerning the variable 5. 

Theorem 4 If mis a first order point and £(m m+ 5) is strictly monotone decreasing 
in 5, then 
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and 



( 11 ) 



g ^ ^-E(a),(o+S) 

2 A n{CD) 

_ dS _ —2X8 

d n(m) Eg+2X n(co) ’ 



( 12 ) 



where Eg is the first partial derivative of the external stimulus with respect to 8. 

Without going into the details (see [15] for details), a first approximation for the 
neural flux can be found which is summarised by the following theorem, 

Theorem 5 If «(©) is continuously differentiable once, then a first order 
approximation of the neural flux is 

J{0)) = «(m) (e"- 1) {v/i(fl)) * pM } . (13) 

where V^i(<o) is the first derivative of a box (or rectangular) function (which is 
unity between minus one and one and which is zero elsewhere, see [15]), p(to) is the 
probability density function of ox Paj(©) = dp(fi))/d<aand * is the convolution operator. 

One of the most important works of Lemmon in CINN is to show how the slope of 
the characteristics of neural flux can lead to a set of clustering constraints, which 
restrict the CINN network parameters so that it follows a gradient ascent to a 
smoothed version of the source density [15]. Thus, the following theorem relates to 
the slope of the characteristics, i.e. 



Theorem 6 If © is a first order point of the LTM space, then the slope of the 
characteristic in (on f)-space through this point is 



01 = (e"- 1) 



[ (C-£)£, 

[Eg+2Xn 






2{^-E)S^ 
Eg + 2 Xn 




(14) 



where v'o(oi) = fl(of^, and v^i(©) = 5(5^- ©^) V^(fli)- 

Since is approximately equal to (2/3)5Vo [15], we can approximate (14) as 



01 = 1 )^^ 



i^-E}Eg-j 

Eg.^ 

* 8 



d© 



(V'o*P) 



(15) 



Similar to [15], (15) indicates that the characteristics follow the gradient of a 
smoothed version of the source density since the probability density function p(©) is 
convolving with a version of the box function yb- It is clear that the bounds for E can 
be derived from (15), which is 



e\n(48) < E < t;. 



(16) 
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where 6 and ^ are some positive numbers. Recall that £ is a monotone decreasing 
function between the input x and the weight vector at (16) indicates that all those 
nodes with E< ^ will converge to the source density. 

Furthermore, (15) indicates that there is a region of attraction where E is bounded 
by (16). Outside this region, it is effectively a repulsive region because the term 
within the square bracket of (15) is negative. Thus by (15), for unfair initialization (as 
discussed in section 1), only one node will converge to the source density (see 
example given in [13]) if the sensitivity parameter y is not used. The regions 
attraction and repulsion may therefore account for the push-pull effect observed in the 
simulation results given in [8-11, 13-14]. 

These complicated equations indicate that the ensemble of output nodes will 
converge towards the cluster centres if they satisfy certain constraints and bounds. If 
these are not satified, the long-term behaviour of those nodes will not follow the 
source density, which accoimt for the messy vector traces when parameters are not 
carefully selected in [13]. 

4 Conclusions 

In this paper, the dynamics of lotto-type competitive learning (LTCL) is studied via a 
set of LTCL equations. The stability of the short-term memory is easily verified. 
Convergence of the winning nodes follows directly from the differential equation of 
the update rule. Following the works of Lemmon, the flows of the LTCL equations 
are carefully analysed. Again, within a presentation interval, it is shown that the slope 
of the characteristics of the winning nodes will follow a smoothed version of the 
source density. Comparing (8) and (15), we can clearly see that the winning nodes 
will converge to the source density. However, the starting point of both equations 
are different: (8) is a solution to the LTM state equation, and (15) is a consequence of 
the flow of the nodes in the network as a whole. The bounds on E suggest that it is 
related to the control parameters a, fi and k of the dynamic equations, as well as the 
presentation interval parameter S. Finally, the LTCL equations provide an alternative 
neuronal model for studying more complicated neuronal activities. Future work will 
concentrate on the capability of the proposed model in handling exceptional situations 
and cases where the source densities may be unevenly distributed. 
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Abstract. We present a non-hierarchal clustering algorithm that can 
determine the optimal number of clusters by using iterations of k-means 
and a stopping rule based on BIC. The procedure requires twice the 
computation of k-means. However, with no prior information about the 
number of clusters, our method is able to get the optimal clusters based 
on information theory instead of on a hemlstic method. 



1 Introduction 

One of the typical methods for non-hierarchal clustering — k-means — is often 
used for huge data clustering as well as self-organizing map [8,9], because it 
reciuires only O(kN) computation for a given number of clusters k and sample 
size N . In the context of recent research in data mining, several high-performance 
techniques for k-means have been developed [1,6]. 

The different methods for k-means calculations vary in several aspects. In all 
cases, the problem remains that k-means might not converge to a global opti- 
mum, depending on the selection of initial seeds. Nevertheless, from data mining 
and knowledge discoverly perspective, we are convinced that a pre-determinance 
of the number of clusters is a strict restriction. 

Indeed, we can obtain an optimal number of clusters heuristically by perform- 
ing computations based on different initial settings of cluster numbers. Hardy [2] 
surveyed seven typical evaluation criteria (two of them can be applied for hierar- 
chal clustering methods) with various datasets. However, varying the number of 
clusters requires much computation, because we have to use k-means repeatedly. 

VVe propose an algorithm that initially divides data into clusters whose num- 
ber is sufhciently small, and continues to divide the each cluster into two clusters. 
We use BIC (Bayesian Information Criterion[7]) as the division criterion. We will 
show that the division method works well, and present an implementation. The 
idea was proposed also by [5], but our method differs in the following aspects: 

1. Our method can be applied for general or /^-dimensional datasets. 

2. We consider the magnitude of variance and covariance around the centers of 
clusters which can be divided progressively. 

3. We evaluate the number of clusters by means of computer simulation runs. 

Previous research [5] can treat only two-dimensional datasets, and assumes 
the variance around the cluster centers to be a constant. As a consequence of 
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progressive division, the number of elements which is contained in each cluster 
becomes fewer, and the variance will become smaller. Therefore, magnitude of 
variance should be considered. 

In section 2, we describe the principle of k- means and show a proposed algo- 
rithm in section 3. In section 4, we evaluate the number of hnal clusters. 

2 K-means method 

The procedure of k-rneans proposed by [4] is as follows: 

1. Get the initial ^-elements in the dataset, and set them as clusters which 
consist of one element. 

2. Allocate the remaining data to the nearest neighborhood cluster centers. 

3. Calculate the cluster centers, and regard them as hxed seeds. Repeat once 
to allocate the all data to the nearest neighbor cluster seeds. 

Most k-means procedures, however, require that the data must be allocated 
repeatedly until the cluster centers will converge. 

3 X-means 

Pelleg[5] thought of the basic idea for a 2-division procedure and named it x- 
means, indicating that the number of clusters with k-means is indehnite. The 
algorithm of x-rneans is quite simple; we begin to divide data into clusters whose 
number is sufficiently small, and continue to divide the cluster into two clusters. 

The algorithm proposed in this paper is summarized as follows: 

step 0: Prepare p-dimensiorial data whose sample size is n. 
step 1: Set an initial number of clusters to be ko (the default is 2), which should 
be sufficiently small. 

step 2: Apply k-means to all data with setting k — ko- We name the divided 
clusters 

C’i,C’2,...,a„. 

step 3: Repeat the following procedure from step 4 to step 9 by setting i — 
l,2,...,fco. 

step 4: For a cluster of Gp apply k-means by setting k — 2. We name the 
divided clusters 

step 5: We assume the following p-dirnensional normal distribution for the data 
Xj contained in Gp 

/(0px) = (27r)“P/^|V,|“^/^exp 

then calculate the BIG as 

big = -2 log T(^ ; Xi E Ci) + q log rii , (2) 






^(x- /r, 



( 1 ) 
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where 0, = [/?, , V,] is the rnaximuiii likelihood estimate of the p-dirnensional 
normal distribution; /t, is p-dimensional means vector, and V, is p x p di- 
mensional variance-covariance matrix; q is the number of the parameters 
dimension, and it becomes 2p if we ignore the covariance of V,. x, is the 
p-dimensional data contained in Ci] rii is the number of elements contained 
in C'i. L is the likelihood function which indicates L{-) — n/(')- 
We choose to ignore the covariance of V, . 
step 6: We assume the p-dimensional normal distributions with their parame- 
ters for respectively; the probability density function of 

this 2-division model becomes 

where 

^ _ I 1, if X is included in 

0, if X is included in ; 

Xj will be included in either or ; Ui is a constant which lets equation 
(3) be a probability density function (1/2 < ai < !)• If obtaining a exact 
value is wanted, we can use p-dirnensional numerical integration. But this 
requires much computation. Thus, we approximate a, as follows: 



a, =0.5/ A' (A), (5) 

where /3i is a normalized distance between the two clusters, shown by 

K(-) stands for an lower probability of normal distribution. 

When we set /3j = 0,1, 2, 3, a,- becomes 0.5/0.500 = 1, 0.5/0.841 = 0.59, 
0.5/0.977 = 0.51, 0.5/0.998 = 0.50 respectively. 

The BIG for this model is 

BIG' = -21ogL'(^;x,- e Ci) + q' logTii, (7) 

where 0' = , 0p^] is a maximum likelihood estimate of two p-dimensional 

normal distributions; since there are two parameters of mean and variance for 
each p variable, the number of parameters dimension becomes q' — 2 x2p — 
4p. L' is the likelihood function which indicates L'[-) — ]/[i/(')- 
step 7: If BIG ^ J31C , we prefer tlie two-divided model, and decide to coiitiiiue 
the division; we set 

c.^cp. 

As for ept we push the p-dimensional data, the cluster centers, the log 
likelihood and the BIG onto the stack. Return to step 4. 
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step 8: If BIG ^ BIG , W6 pr6f6r not to cliviclo clustors ciny iiioro, cincl clociclo to 
stop. 

Extract the stacked data which is stored in step 7, and set 

a 

Retnrn to step 4. If the stack is empty, go to step 9. 
step 9: The 2-division procednre for Ci is completed. VVe rennmber the clnster 
identihcation snch that it becomes nniqne in Ci . 
step 10: The 2-division procednre for initial divided clnsters is completed. 

We rennmber all clnsters identihcations snch that they become nniqne. 
step 11: Ontpnt the clnster identihcation nnmber to which each element is 
allocated, the center of each clnster, the log likelihood of each clnster, and 
the nnmber of elements in each clnster. [stop] 

The reasons why we choose BIG over other common information criteria for 
model selection are follows: 

- BIG considers the selection among from exponential family of dist rib nt ions. 

- BIG is based on prior probability rather than the distance between two 
distribntions. 

4 Evaluation of the performance 

4.1 An investigation of the nnniber of generated clnsters 

A simnlation procednre is adopted. It generates 250 two-dimensional normal 
variables; these random variables shonld be clnstered into 5 gronps. Each gronp 
consists of 50 elements: 

X, ~ = [0, 0], a = [0.2, 0.2]), (j = 1, . . . , 50) 

xj ~ N(n = [-2, 0], a = [0.3, 0.3]), (j = 51, . . . , 100) 

X, ~ NiiJi = [2, 0], a = [0.3, 0.3]), (j = 101, . . . , 150) 
xj ~ N(n = [0, 2], a = [0.4, 0.4]), (j = 151, . . . , 200) 

X, ~ NiiJi = [0, -2], a = [0.4, 0.4]), (j = 201, . . . , 250) 

where p is a mean, and cr^ is a variance. We set fcg = 2 as an initial division, and 
performed 1,000 simnlation rnns of x-means. Two-dimensional normal variables 
are generated for each simnlation rnn. X-means will call k-means repeatedly; the 
algorithm of k-means is based on [3] , which is provided in R. 

Table 1 snmmarizes the nnmber of clnsters generated by x-means (npper 
row). Eor 1,000 simnlation rnns, the most freqnent case is when 5 clnsters are 
generated, this occnrs 533 times. The second most freqnent case is 6 clnsters, 
which occnrs 317 times. The middle row shows the resnlts applying AIG (Akeike’s 
Information Griterion) instead of BIG to x-means. We fonnd that x-means by 
AIG tends to overgenerate clnsters. The bottom row in Table 1 shows the nnmber 
of optimal clnsters when the goodness of model for give data is rnaximnm (i.e.. 
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Table 1. The number of clusters by using 250 random variables of two-dimensional 
normal distribution 



number of clusters 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


total 


x-means (BIC) 


2 


~6 


IT 


469 


383 


99 


27 


X" 


IT 


IT 


0 


T 


T 


1,000 


x-means (AIC) 


2 


1 


1 


322 


295 


162 


93 


54 


36 


17 


11 


2 


4 


1,000 


heuristic method 


0 


2 


37 


559 


265 


90 


35 


8 


4 


0 


0 


0 


0 


1,000 



the AIC for given data is minirnum) by varying k applied to k-means. This 
distribntion is very similar to the distribntion in npper row. 

The cluster centers found by k-means are not always located where the el- 
ements cohere; thus x- means often divides a cluster into two clusters until new 
clusters centers will converge where the elements cohere. Consequently, x-means 
produces rather more clusters than adequate. Actually in our simulation, when 
x-means divides all 250 (= 50 x 50)data into two clusters equally (i.e, 125 ele- 
ments each), both subclusters are often divided into three clusters (50 -|- 50-1-25), 
resulting in 6 clusters. 

4.2 An investigation of the nnniber of clnster elements 

After applying x-means to the simulation runs, we can obtain the distributions 
of the number of cluster elements, as shown in Fig.l. The horizontal axis gives 
the cluster identification number, which is sorted in increasing order by the 
number of cluster elements; the vertical axis gives the distribution of number of 
the cluster elements; box-and-whisker charts are used. 

A box-and-whisker chart contains a box surrounding two hinges, two whiskers, 
and outlier(s) if any; a lower or upper hinge shows 25 or 75 percentile of the dis- 
tribution; the median(50 percentile) is in between two hinges. The two whiskers 
stands for tails of the distribution; the whiskers extend to the most extreme 
data point which is no more than 1.5 times interquartile range from the box; the 
outlier (s) may be shown if any. 

In case (a), i.e, when obtaining 5 clusters, we found that each cluster consists 
of about 50 elements. In case (b) obtaining 6 clusters, 4 clusters consist of about 
50 elements and the remainder is divided into 2 clusters. Case (c), obtaining 
7 clusters, is similar to (b); 3 clusters consist of about 50 elements and the 
remainder is divided into 4 clusters. For cases (b), (c), and (d), the proper 
division in clusters of 50 was performed, although the generated cluster may be 
rather small. 

4.3 Consideration of the compntational anionnt 

X-means requires to find k final clusters, even if it repeats to divide into two 
clusters. In addition, we need to judge if these k final clusters should not be di- 
vided any more. Thus, remembering that k-means requires O(kN) computation, 
x-means will take twice as much computation compared to k-rneans. 
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cluster number 



cluster number 



cluster number 



cluster number 



(a) 5 clusters (b) 6 clusters (c) 7 clusters (d) 8 clusters 

Fig. 1. Distribution of the number of cluster elements contained in final clusters 



frideed the computation of BfC is needed, but we can ignore this because we 
calculate it only once after fixing the cluster elements. The BfC can be easily 
obtained from the mean and variance-covariance of a p-dirnensional normal dis- 
tribution. We are convinced that x-means gives us a quite good solution which 
meets with its computational ex])ense, although the solution may not be an opti- 
mum. This program can be obtained via http://www.rd. dnc.ac.jp/'tunenori/src/ 
xmeans.prog. 
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Abstract. This paper presents the decision clusters classifier (DCC) for 
database mining. A DCC model consists of a small set of decision clusters 
extracted from a tree of clusters generated by a clustering algorithm 
from the training data set. A decision cluster is associated to one of the 
classes in the data set and used to determine the class of new objects. 
A DCC model classifies new objects by deciding which decision clusters 
these objects belong to. In making classification decisions, DCC is similar 
to the fc-nearest neighbor classification scheme but its model building 
process is different. In this paper, we describe an interactive approach to 
building DCC models by stepwise clustering the training data set and 
validating the clusters using data visualization techniques. Our initial 
results on some public benchmarking data sets have shown that DCC 
models outperform the some existing popular classification methods. 



1 Introduction 

We consider a finite set of n objects X = {a;i, a^2, ■ • ■ , together with m 
attributes describing the properties of these objects. The data consist of n Tri- 
dimensional feature vectors in a sample space V. Most clustering methods for- 
malize the intuitive ideas of a “cluster” of objects, which is typically defined as 
a subset C C X such that the objects in C fulfill certain criteria of homogeneity 
that are directly verified in the data set. For instance, all pairwise dissimilarities 
are less than a certain value. To build a classification model from T, it is required 
that the objects must first be labeled in classes with certain regularities. In spa- 
tial context, the basic regularity is that objects close to each other must reflect 
a “similar” behavior and tend to have the same class. This is the proposition of 
the /c-nearest neighbor classifier (KNN). In relation to clustering, objects in the 
same cluster tend to have the same class. As such, classification can be viewed 
as a clustering problem that can be solved with a clustering process. This is also 
the motivation of our work. 

In this paper, we present a decision clusters classifier (DCC) for database 
mining. A DCC model is defined as a set of p decision clusters generated with a 
clustering algorithm from the training data set. A decision cluster is labeled of 
one of the classes in data. The DCC model classifies new objects by deciding to 
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which decision clusters these objects belong. In making classification decisions, 
DCC is very similar to KNN. However, their model building processes are dif- 
ferent. We use the interactive clustering process [5] to build DCC models. Given 
a training data set, we first interactively partition it into several clusters and 
build a cluster tree from them. Then, we select a subset of clusters from the tree 
as a candidate DCC model and use a tuning data set to validate the model. In 
building the cluster tree, we employ different visualization techniques, including 
FASTMAP projection [3], to visually validate clusters generated at each step by 
the fc-prototypes clustering algorithm [4] which we have proposed and used. In 
building a DCC model, we consider an additional measure to validate clusters. 
We require the objects in a cluster to be dominated by one class as much as 
possible. In [8], Mui and Fu presented a binary tree classifier for classification 
of nucleated blood cells. Each terminal node of the tree is a cluster dominated 
by a particular group of blood cell classes. To partition a nonterminal node, 
class groupings were first decided by projecting the objects onto a 2-dimensional 
space using the principal component analysis (PCA) and visually examining how 
to separate the classes. Then, the decision rule of a quadratic classifier was re- 
peatedly used to test different combinations of features and to select the “best” 
subset of feature vectors that can be used to partition the objects into the two 
clusters. Therefore, different feature subsets were used at different non-terminal 
node of the tree to partition data. This work was later advanced by the use 
of the fc-means algorithm to generate clusters at each non-terminal node and 
determine the grouping of classes [7]. 

Mui and Fu’s work [8] was an early example of using the interactive approach 
to building classification models. The main reason was due to the limitation of 
computing powers available in the early 80s. Although the study of algorithms 
for building classification models has been focused on automatic approach, the 
interactive approach has recently been brought to attention again [1] with en- 
hancement of the sophisticated visualization techniques. The great advantage 
of the interactive approach is that human knowledge can be used to guide the 
model building process. 

The paper is organized as follows. In Section 2, we describe an interactive 
approach to building DCC models. In Section 3, some experimental results are 
given to illustrate the effectiveness of DCC models. Finally, some concluding 
remarks are given in Section 4. 

2 Construction of DCC Models 

In this section, we describe an interactive approach to building DCC models 
from training data sets. A cluster tree represents a set of hierarchical clusterings 
of a data set. Before we discuss our top-down approach to building a cluster tree, 
we start with the following definitions. 

Definition 1. An m-clustering of X is a partition of X into m subsets (clus- 
ters), which satisfies: 

Ci ^ i = 1, ■ ■ ■ ,m, U™ = X, and C* n Cj = 0, i ^ j, i,j = 1, • ■ ■ , m. 
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Definition 2. A clustering S with k clusters is said to be nested in the clustering 
T, which contains r (< k) clusters, if for any cluster Ci in S, there is a cluster 
Cj in T such that Ci C Cj. And there exists at least one cluster in S, which 
holds Ci C Cj and Ci ^ Cj. 

Definition 3. A cluster tree is a sequence of nested clusterings, so that for any 
i, j with i < j and for any Cj G Sj, there is Ci G Si such that Cj C Q. 

We remark that a clustering is a DCC model (cf. defined as in Section 1) if 
clusters in the clustering have dominant classes. Therefore, a cluster tree repre- 
sents a set of DCC models. 

Building a cluster tree from a training data set is to find a sequence of nested 
clusterings in the data set. We can use a top-down approach to generating a 
clustering by recursively applying a clustering algorithm. Similar to the process 
of the construction of a decision tree [8], we start from the whole data set and 
partition it into k clusters. Then for each cluster, we can further partition it into 
k' sub-clusters. This process is repeated and a tree of clusters grows until all 
leaves of the tree are found. 

At each node of the tree, we need to decide whether to further partition it into 
sub-clusters or not and how. This is equivalent to deciding the terminal nodes and 
the best splitting in decision trees. In fact, our cluster tree is a kind of decision 
trees although we do not use it to make classification decisions. We determine a 
cluster as a terminal node based on two conditions: (i) its objects are dominated 
in one class and (ii) it is a natural cluster in the object space. Condition (i), 
which is widely used in many decision tree algorithms, is determined based on 
the frequencies of classes in the cluster. If no clear dominant class exists, the 
cluster will be further partitioned into sub-clusters. 

If a cluster with the dominant class is found, we do not simply determine it 
as a terminal node. Instead, we investigate whether the cluster is a natural one 
or not by looking into its compactness and isolation [6]. To do so, we project the 
objects in the cluster onto a 2-dimensional (2D) space and visually examine the 
distribution of the objects. If the distribution shows more than one sub-cluster, 
then we use the clustering algorithm to find these sub-clusters. Otherwise, we 
identify the cluster as a terminal node with a dominant class. 

The Fastmap algorithm [3] is used for projecting objects onto the 2D space 
because we can deal with categorical data. Given a cluster, the 2D projection 
allows us to visually identify whether sub-clusters exist in it. If we see any 
separate clusters in the 2D projection, we can conclude that sub-clusters exist 
in the original object space. However, if there are no separate clusters on the 
display, we do not simply conclude that the cluster is a natural cluster. Instead, 
we visualize the distribution of the distances between objects and the cluster 
center. This visual information further tells us how compact the cluster is. 

In our work, we use the fc-prototypes algorithm [4] to cluster data because 
of its efficiency and capability of processing both numeric and categorical data. 
To partition a cluster into sub-clusters with fc-prototypes, we need to specify k, 
the number of clusters to be generated. Here, we take advantage of the Fastmap 
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projection to assist the selection of k. By projecting the objects in the cluster 
onto a 2D space, we visualize objects of different classes in different symbols or 
colors. We can examine the potential number of clusters and the distribution 
of object classes in different clusters. Therefore, in determining k, we not only 
consider the number of potential clusters but also the number and distribution 
of classes in the cluster. 

Let X denote the training data set, & the fc-prototypes clustering algorithm 
and F the Fastmap algorithm. We summarize the interactive process to building 
a cluster tree as follows. 

1. Begin; Set X as the root of the cluster tree. Select the root as the current node Sc. 

2. Use F to project Sc onto 2D. Visually examine the projection to decide k, the number of 
potential clusters. 

3. Apply & to partition Sc into k clusters. 

4. Use F and other visual methods to validate the partition. (The tuning data set can also be used 
here to test the increase of classification accuracy of the new clustering.) 

5. If the partition is accepted, go to step 6, otherwise, select a new k and go to step 3. 

6. Attach the clusters as the children of the partitioned node. Select one as the current node Sc. 

7. Validate Sc to determine whether it is a terminal node or not. 

8. If it is not a terminal node, go to step 2. If it is a terminal node, but not the last one, select 
another node as the current node Sc, which has not been validated, and go to step 7. If it is 
the last terminal node in the tree, stop. 



After we build a cluster tree from the training data set using the above 
process, we have created a sequence of clusterings. In principle, each clustering 
is a DCC model. Their classification performances are different. Therefore, we 
use the training data set (or the tuning data set) to identify the best DCC 
model from a cluster tree. We start from a top level clustering. First, we select 
all clusters of the top level clustering as decision clusters, use them to classify the 
training data set and calculate the classification accuracy. Then we identify the 
decision clusters, which have classified more objects wrongly than other clusters. 
We replace these clusters with its sub-clusters in the lower level clustering and 
test the model again. We continue this process until the best DCC model is 
found. 

Each level of clustering in the cluster tree is a partition of the training data 
set. However, our final DCC model is not necessarily to be a partition. In the 
final DCC model, we often drop certain clusters from a clustering. These clusters 
contain few objects in several classes. These are the objects, which are located in 
the boundaries of other clusters. From our experiment, we found that dropping 
these clusters from the model can increase the classification accuracy. 

3 Experimental Results 

We have implemented a prototype system, called VC-f, in Java to facilitate 
the interactive process to build DCC models. VC-|- is composed of three major 
components, a tree mechanism that maintains the tree of clusters gradually 
growing during the clustering process, a data mining engine based on the k- 
prototypes algorithm and data projection based on Fastmap. VC-|- also contains 
a number of visual functions, which are used to visualize the characteristics 
of clusters, cluster compactness and relationships as well as the distribution of 
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classes in clusters. These functions can be applied to any single cluster or group 
of clusters selected by the user from the tree diagram. VC+ can also be used to 
solve both clustering and classification problems [5]. 

In our initial experiments, we tested our DCC models against four public data 
sets chosen from the UCI machine learning data repository [2] and compared our 
results with the results of the Quinlan’s C5.0 decision tree algorithm, Discrim (a 
statistical classifier developed by R. Henery), Bayes (a statistical classifier which 
is a part of IND package from NASA’s COSMIC center) and KNN (a statistical 
classifier, developed by C. Taylor). The characteristics of the four data sets 
are listed in Table 1. Size, data complexity and classification difficulties were 
the major considerations in choosing these data sets. The Heart and Credit 
Card data sets contain both numeric and categorical attributes. The Heart and 
Diabetes data sets are among those that are difficult to classify (low classification 
accuracy) [2]. The training and test partitions of these public data sets were taken 
directly from the data source. In conducting these experiments with VC+, we 
used training data sets to build cluster trees and used the test sets to select the 
DCC models. In growing the cluster trees, the training data sets were used as 
the tuning data to determine the partitions of nodes, together with other visual 
examinations such as Fastmap projection. We used the Clementine Data Mining 
System to build the C5.0 and boosted C5.0 models from the same training data 
sets and tested these models with test data sets. To create better C5.0 models, 
we used the test data sets to fine-tune the parameters of the algorithm. As such, 
both our models and C5.0 models were optimistic and comparable. 

Table 2 shows the comparison results. The classification results for these 
four data sets by using Discrim, Bayes and KNN are listed in [9]. For the Heart 
data, the best test results from VC-I- and C5.0 are the same. This is because 
the Heart data set was very small, it was easy for these models to obtain the 
best result from the training and test partition. For the Credit Card data, our 
model was equivalent to the Boosted C5.0 model. For the Diabetes and Satellite 
Image data, our models produced better results than the C5.0 and Boosted C5.0 
models. Particularly, the accuracy increase in the Diabetes data was significant. 
These results demonstrate that our clustering approach is more suitable for 
numeric data than the decision tree algorithms like C5.0. The low accuracy of 
the Boosted C5.0 on the Diabetes data could be caused by the noise in data. 

4 Concluding Remarks 

We have presented a new classifier, the DCC model, for database mining and 
a new approach to interactively building DCC models by clustering and cluster 
validation. Our initial experimental results on four public domain data sets have 
shown that our models could outperform the popular C5.0 models. Our inter- 
active approach facilitated by a simple visual tool with an efficient clustering 
algorithm for clustering and a few visual functions for cluster validations can 
easily achieve a near-optimal solution. Because a DCC model is simply a set of 
clusters, it is easy to be interpreted and understood. It is also straightforward 
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to deploy a DCC model in the enterprise data warehouse environment, since 
the only task is to implement the distance function used and the set of cluster 
center records. The model is also efficient in classifying new data because the 
number of cluster center records is usually small. Our interactive approach has 
a special feature. That is, we use the same process and functionality to solve 
both clustering and classification problems [5]. Such integration will be a great 
benefit to business users because they do not need to worry about the selection 
of different algorithms. Instead, they can focus on data and business solutions. 





Name of data sets 




Heart 


Credit Card 


Diabetes 


Satellite Image 


Training records 


189 


440 


537 


4435 


Test records 


81 


213 


230 


2000 


Categorical fields 


6 


9 


0 


0 


Numerical fields 


7 


6 


8 


36 


Number of classes 


2 


2 


2 


6 



Table 1; Test data sets 



Name of data sets 





Heart 


Credit Card 


Diabetes 


Satellite Image 




train 


test 


train 


test 


train 


test 


train 


test 


VC-I- 


90.49 


87.65 


90.23 


87.79 


83.99 


81.74 


91.64 


91.60 


C6.0 


96.30 


87.65 


90.00 


84.98 


80.26 


76.09 


98.84 


85.90 


Boosted C5.0 


98.94 


87.65 


99.09 


87.32 


96.65 


73.91 


99.95 


90.45 


Discrim 


68.50 


60.70 


85.10 


85.90 


78.00 


77.50 


85.10 


82.90 


Bayes 


64.90 


62.60 


86.40 


84.90 


76.10 


73.80 


69.20 


72.30 


KNN 


o 

o 

d 

o 


52.20 


o 

q 

d 

o 


00 

d 

o 


o 

q 

d 

o 


67.60 


91.10 


90.60 



Table 2: Classification accuracy (in terms of %) for different methods. 
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Abstract. This paper presents a new distributed data clustering algo- 
rithm, which operates successfully on huge data sets. The algorithm is 
designed based on a classical clustering algorithm, called PAM [8, 9] and 
a spanning tree-based clustering algorithm, called Clusterize [3]. It out- 
performs its counterparts both in clustering quality and execution time. 
The algorithm also better utilizes the computing resources associated 
with the clusterization process. The algorithm operates in linear time. 



1 Introduction 

With the ever-increasing growth in size and number of available databases, min- 
ing knowledge, regularities or high-level information from data becomes essential 
to support decision-making and predict future behavior [2,4-6]. Data mining 
techniques can be classified into the following categories: classification, cluster- 
ing, association rules, sequential patterns, time-series patterns, link analysis and 
text mining [2,5, 11]. Due to its undirected nature, clustering is often the best 
technique to adopt first when a large, complex data set with many variables 
and many internal structures are encountered. Clustering is a process whereby 
a set of objects is divided into several clusters in which each of the members is 
in some way similar and is different from the members of other clusters [8-10, 
12]. The most distinct characteristics of clustering analysis is that it often en- 
counters very large data sets, containing millions of objects described by tens 
or even hundreds of attributes of various types (e.g., interval-scaled, binary, or- 
dinal, categorical, etc.). This requires that a clustering algorithm be scalable 
and capable of handling different attribute types. However, most classical clus- 
tering algorithms either can handle various attribute types but are not efficient 
when clustering large data sets (e.g., the PAM algorithm [8,9] or can handle 
large data sets efficiently but are limited to interval-scaled attributes (e.g., the 
fc-means algorithm [1, 2, 7, 10]. 

In this context, several fast clustering algorithms have been proposed in the 
literature. Among which, CLARA [8] is one, which is designed based on a sam- 
pling approach and a classical clustering algorithm, called PAM [8, 9]. Instead of 
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finding medoids, each of which is the most centrally located object in a cluster, 
for the entire data set, CLARA draws a sample from the data set and uses the 
PAM algorithm to select an optimal set of medoids from the sample. To allevi- 
ate sampling bias, CLARA repeats the sampling and clustering process multiple 
times and, subsequently, selects the best set of medoids as the final clustering. 

Since CLARA adopts a sampling approach, the quality of its clustering re- 
sults depends greatly on the size of the sample. When the sample size is small, 
CLARA'S efficiency in clustering large data sets comes at the cost of cluster- 
ing quality. To overcome it, this paper presents a distributed fc-medoid cluster- 
ing algorithm, designed by utilizing the classical PAM and a weighted minimal 
spanning tree based clustering algorithm. The algorithm aims to offer better 
clustering quality and execution time, by economic utilization of the computing 
resources. It is scalable and operates in linear time. 

Though the search strategies employed by both the clustering algorithms 
(i.e. CLARA and the proposed one are fundamentally same, the performances 
of the proposed algorithm in terms of clustering quality and execution time for 
clustering large data sets are significant over CLARA. 

The remainder of the paper is organized as follows. Section 2 reviews the 
CLARA and Clusterize. Section 3 details the design of the proposed distributed 
clustering algorithm. A set of experiments based on synthetic data sets with pre- 
specified data characteristics was conducted on both the algorithms (i.e. CLARA 
and the proposed one), and the results are summarized in Section 4. Finally, the 
contributions of the paper is presented in Section 5. 

2 Review 

This section reviews CLARA and Clusterize in brief : 

2.1 CLARA (Clustering LARge Applications) 

It relies on the sampling approach to handle large data sets [8]. Instead of 
finding medoids for the entire data set, CLARA draws a small sample from the 
data set and applies the PAM algorithm to generate an optimal set of medoids 
for the sample. The quality of resulting medoids is measured by the average 
dissimilarity between every object in the entire data set D and the medoid of its 
cluster, defined as the following cost function: 

1 ^ 

Cost{M,D) = XI d{v,mi) 

veCi 

where M = {m\,m 2 , . . . ,mk], where k is the cardinality of the set M and 
Ci = {v\d{v,mi) < d{v,mj)\/j 7^ i}. Here, N is the total number of data points 
and D = {vi,V 2 , . . . ,Vn}, where n is the cardinality of the set D. 

To alleviate sampling bias, CLARA repeats the sampling and clustering pro- 
cess a pre-defined number of times and subsequently selects as the final clustering 
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result the set of medoids with the minimal cost. Assume q to be the number of 
samplings. The CLARA algorithm is presented next: 

item Set mincost to a large number; 

Repeat q times 

Create S by drawing s objects randomly from D-, 

Generate set of medoids M from S by applying PAM\ 

If Cost (M, D) < mincost then 
mincost = Cost(M, H); 
bestset = M; 

End-if; 

End-repeat; 

Return bestset; 

For k number of medoids, the complexity of the above algorithm is 0{{s‘^k.R + 
Nk)q), where R is the number of iterations required. As found in [12], for the 
sample size of 40-1- 2fc and for q = 5, the possible order of complexity of CLARA 
will be 0{Nkq) ( as fc is negligibly smaller than N), which is linear in order. 

Since CLARA adopts a sampling approach, the quality of its clustering re- 
sults depends greatly on the size of the sample. When the sample size is small, 
CLARA'S efficiency in clustering large data sets comes at the cost of clustering 
quality. 

2.2 Clusterize 

This algorithm operates on a m-dimensional space to construct a minimal span- 
ning tree based on the ’weights’ i.e. the distances computed for each pair of 
m-dimensional points. Next, subject to a defined 'distance threshold' e, it deletes 
those edges from the tree, which have equal or greater values than e. It will 
decompose the original tree into fc-connected ’subtrees’ where each subtree will 
be a cluster. To represent each of these clusters, Clusterize finds the medoid for 
each cluster, by utilizing the ’center’ concept for each tree. Next, the algorithm 
Clusterize is presented : 

1 : Construct a minimal spanning tree r for the input m-dim. data points; 

2 : Apply ’threshold’, e on r to decompose it into fc-sub-trees i.e. n, T 2 , ■ ■ ■ , Tjt 

(where each tj corresponds to a cluster in the m-dim. space; 

3 : determine the ’center’ Cj (j = 1, ■ • • k) for each tj, which will be treated as 

a ’medoid’ for the corresponding cluster; 

The time complexity of the algorithm for finding the spanning tree over 2-D 
space is 0{elogn), where ’e’ is the number of edges in the spanning tree and 'n' 
is the number of vertices. Now, as the algorithm operates over a m-dimensional 
space for finding fc-medoids, the actual complexity will be 0{elogn x m). 

3 The Proposed Distributed Algorithm 

The proposed algorithm adopts the following symbols and notations: 
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p, k — > no. of machines and medoids respectively; 

d{u, v) — > Eucledian distance between u and v, 

N — > total number of points in the database; 

mt — > j-th medoid in the z-th machine ; 

M* — > set of medoids in the z-th machine ; 

Cj — > j-th cluster in the z-th machine ; 

n{x) — > number of data points represented by a medoid x-, 

n{rrfj) = |C]| 

Ri — > no. of iterations in the z-th machine and the maximum number of 

iterations, Rmax = ELi 

The proposed algorithm operates in six major steps. In step 1, each ma- 
chine calls Clusterize to locally compute a set of k medoids, i.e. for the i- 
th machine, the set of medoids is M* = {m\,ml, . . . ,m\} and set of clus- 
ters is C* = {€{,€ 2 , ■ ■ ■ ,Cl}. In step 2, a central machine gathers the set 
of all medoids m], (where 1 < « < p and 1 < j < fc). Then in step 3, the 
central machine computes the fc-medoids by calling PAM, by using mt’s, and 
call them as Xi, X 2 , . . . , X/., where each Xj represents a cluster Ij. Here, X = 
{Xi, X2 , . . .,Xk} C M and = M. Afterwards, it computes the ’weighted 

means’ of the elements in as 



1 



maxjjgY; n{v) 



^ v{n{v)) = Di 

veYi 



and communicates D^, D 2 , ■ ■ ■ D). to each of the participating p-machines. In 
step 4, the Tth machine computes a set of k data points u\,u\, - ■ - u], using Hj’s, 
where each u*. is a point in the Tth machine closest to Dj. Step 5 involves in 
gathering the computed {k x p) points in the central machine and finally, in 
step 6, it computes the actual set of fc-medoids i.e. vi,V 2 ,- ■ -Vk, where each Vj G 
{upuj, ■ ■ ■ Uj} and d{vj,Dj) < d{up Dj), for all i. The steps of the algorithm is 
presented next. 

1. f-th machine calls Clusterize to compute k medoids M* = {m\,ml, ■ ■ ■ m\} 
and corresponding set of k clusters C* = {C{, C 2 , ■ ■ ■ Cl}. 

2. Central machine gathers ’p’ set of medoids (each of cardinality ’fc’) M = 
{nfj , where 1 < j < fc and 1 < f < p} 

3. Central machine - 

3.1. computes fc-medoids {Xi, X2, • • • Xf.} C M and set of the correspond- 

ing clusters {Ti, I2, • • • Tj.} using PAM, where the cluster is repre- 
sented by Xj and = M. 

3.2. computes fc-weighted means Di, D 2 , ■ ■ ■ D/., 

where Dj = ^ , ... . v(n(v)); 

^ maxv^y. ^v€Yi ^ ^ 

3.3. communicates Di,D 2 , - ■ ■ Df. to all the p machines; 

4. f-th machine computes k data points u\,ul,- ■ - u\, where, each u} is a point 
in the f-th machine closest to Dj ; 

5. Central machine gathers the p set of computed data points V = {u}, where 
1 < i < p and 1 < j < fc}; 
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6. Central machine computes the final set of fc-medoids Vi,V 2 ,- ■ -Vk, where 
Vj G {uj,Uj, ■ ■ ■ Uj} and d{vj,Dj) < d{u^j,Dj) for all t; 



3.1 Complexity Analysis 

The step-wise complexity analysis of Disk-kjmedoids is as follows : 

Step 1 computes the local medoids, approximately in time Next, for 

gathering the data, the time requirement of step 2 is 0{pk). Then, the central 
machine computes the fc-medoids in step 3.1, in time 0{p^k‘^). In step 3.2, for 
fc- weighted means computation, central machine requires 0{pk) and for com- 
municating, similarly as found in step 2, the step 3.3 will require 0{pk) time. 
Step 4, for each Tth machine computation, will require 0{ .k) times and 

similarly, for gathering the {p x k) data points, step 5’s time requirement will 
be 0{pk) and finally, for the optimum fc-medoids computation, step 6 requires 
0{pk) time;‘ 

However, the complexities due to steps 2, 3.2, 3.3, 5 & 6 are dominated by 
the complexities offered by steps 1, 3.1 & 4. Among the dominating complexities, 
it can be seen that- step 1 contributes the major, and which is of linear in order. 

4 Experimental Results 

To justify the efficiency of the algorithm in comparison to CLARA, experiments 
were conducted using 12-dimensional tuple databases of various sizes (where, 
each tuple represents a document image, characterized with 12 features) on a 
HP Visualize Workstation (Linux-based) with Intel CPU clock rate 450 MHz and 
128 MB RAM. For obtaining unbiased estimates, each experiment was carried 
out 15 times and an overall performance estimate was calculated by averaging 
the results of the 15 individual runs. Table 1 depicts the average experimental 
results found. 

As realised from the Table, in terms of the execution time, the proposed algo- 
rithm outperforms CLARA straightway. However, in terms of clustering quality, 
CLARA slightly outforms it in the initial runs, i.e. when given only a small data 
size (i.e. fewer than 4000). Later, with the increase of data size, the proposed 
algorithm can be found to be equally good with CLARA. 

5 Conclusions 

A new, scalable, distributed fc-medoids algorithm, capable of dealing large set 
of data, has been presented in this paper. The algorithm outperforms CLARA 
both in cluster quality and in execution time. A set of experiments based on 
the 12-dimensional feature data- representing a document image database, was 
conducted to compare the performances of the proposed algorithm with its coun- 
terparts, and the results have been presented. There are futher scopes to improve 
the performance of the algorithm. 
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Table 1 : Comparison Results of the Proposed algorithm 

with CLARA 



(Assuming, p = h machines) 



Database Size 


No of 
Medoids k 


[Average Dissimilarity 


1 Execution Time | 


CLARA 


proposed 


CLARA 


proposed 


1000 


10 


0.01693 


0.01932 


270 


73 


2000 


10 


0.01800 


0.01820 


540 


120 


3000 


10 


0.02120 


0.01975 


700 


150 


4000 


10 


0.02170 


0.02085 


810 


180 


5000 


10 


0.01960 


0.02090 


900 


200 


6000 


10 


0.02035 


0.01870 


1000 


210 


7000 


10 


0.02080 


0.01778 


1050 


220 


8000 


10 


0.01930 


0.01740 


1080 


240 
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Abstract: A new nonhierarchical clustering procedure for symbolic objects is presented 
wherein during the first stage of the algorithm, the initial seed points are selected using the 
concept of farthest neighbours, and in suceeding stages the seed points are computed 
iteratively until the seed points get stabilised. 

Keywords: Clustering, Symbolic objects. Symbolic similarity. Symbolic dissimilarity. 

Symbolic Mean. 

1. Introduction; 

The main objective of cluster analysis is to group a set of objects into clusters such 
that objects within the same cluster have a high degree of similarity, while objects 
belonging to different clusters have a high degree of dissimilarity. The clustering of 
a data set into subsets can be divided into hierarchical and nonhierarchical methods. 
The general rationale of a nonhierarchical method is to choose some initial partition 
of the data set and then alter cluster memberships so as to obtain a better partition 
according to some objective function. On the other hand, in hierarchical clustering 
methods, the sequence of forming groups proceeds such that whenever two samples 
belong ( or do not belong ) to the same cluster at some level, they remain together ( 
or seperated ) at all higher levels. Hierarchical clustering procedures can be divided 
into agglomerative methods, which progressively merge the elements, and divisive 
methods, which progressively subdivide the data set. A good survey of cluster 
analysis can be found in literature [1-6]. 

Symbolic objects are extensions of classical data types. In conventional data sets, 
the objects are "individualised" whereas in symbolic data sets, they are more 
"unified" by means of relationships. Symbolic objects are more complex than 
conventional data in the following ways[7] ; 

1 . All objects of a symbolic data set may not be defined on the same variables. 

2. Each variable may take one value or an interval of values. 

3. In complex symbolic objects, the values that the variables take may include 
one or more elementary objects. 

4. The description of a symbolic object may depend on the relations existing 
between other objects. 

5. The values that the variables take may include typicality values that indicate 
frequency of occurrence, relative likelihood, level of importance of the values 
and so on. 

K.S. Leung, L.-W. Chan, and H. Meng (Eds.): IDEAL 2000, LNCS 1983, pp. 35-41, 2000. 
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A non formal description of various types of symbolic objects can be found in [7]. 
Ichino [8,9] defines general distance functions for mixed feature variables and also 
defines generalised Minkowski metrics based on a new mathematical model called 
as the cartesian space model. Gowda and Diday [7,10] have proposed new 
similarity and dissimilarity measures and used it for agglomerative clustering of 
symbolic objects. They form composite symbolic objects using a cartesian join 
operator whenever mutual pairs of symbolic objects are selected for agglomeration 
based on minimum dissimilarity [7] or maximum similarity [10]. The combined 
usage of similarity and dissimilarity measures for agglomerative [11] and divisive 
clustering [12] of symbolic objects have been presented by Gowda and Ravi [11,12 
]. A survey of different techniques for handling symbolic data can be found in 
[13-18]. Most of the algorithms available in literature for clustering symbolic 
objects, are based on either conventional or conceptual hierarchical techniques 
using agglomerative or divisive methods as the core of the algorithm. In this paper, 
we propose a new nonhierarchical clustering scheme for symbolic objects. 

The organisation of the paper is as follows: Section 2 discusses the notion of 
similarity and dissimilarity measures for symbolic objects, along with the concepts 
of farthest neighbour and composite symbolic object formation. The proposed 
nonhierarchical clustering scheme is presented in section 3 and section 4 discusses 
the experimental results. Section 5 concludes the paper. 

2. Concepts and definitions: 

a. Similaritv and dissimilarity between symbolic objects: 

Many distance measures are introduced in the literature for symbolic 
objects[7,8,9,10,ll,12]. Here, we follow the similarity and dissimilarity measures 
introduced by Gowda and Ravi[12] along with a brief explanation of these 
measures. The similarity and dissimilarity between two symbolic objects A and B 

is written as, S (A,B) = S ( Ai,Bi) -i- ....S(A]i,Bj;) , and D(A,B) = D(A|,Bi) -i- -i- 

D(Ak,Bk). For the k th feature, S(Aj;,Bj;) and D(A]i,B|j) are defined using the 
following components namely, Sp(Aj;,B|;) and Dp(A]i,B]i) due to position, Ss(Ak,B|j) 
and Ds(Ak,Bi^) due to span and Sc(A]i,B]i) and Dc(A]i,B|j) due to content. The 
advantages of the proposed similarity and dissimilarity measures are discussed in 
[11][12]. 

Quantitative interval type of Ai, and Bk: 

Let al , au and hi, bu represent lower and upper limit of interval A^ and B^, 
inters=length of intersection of A^ and Bj^, Is = span length of Aj^ and Bj^ = | 
max(au,bu) - min (al,bl) | where max() and min() represent maximum and 
minimum values respectively. The similarity and dissimilarity between two 
samples A^ and Bj^ is defined on position and span. Similarity due to position is 
defined as, Sp(A]i,B]i) = sin [ ( 1- ( ( al -bl) / Uj; ) ) * 90 ] and similarity due to span 
is defined as Ss(A]i,Bk) = sin [ ( ( la H- lb) / ( 2* Is) ) * 90 ], where Uk denotes the 
length of the maximum interval of the k th feature and la = | au -al| and lb = |bu - 
bl|. Net similarity between Ak and Bk is S (Ak,Bk) = Sp(Ak,Bk) + Ss(Ak, Bk). 
Dissimilarity due to position is defined as Dp(Ak,Bk) = cos[ ( 1- ( ( al - bl) / Uk) ) * 
90] and dissimilarity due to span is defined as Ds(Ak,Bk) = cos [ (( la_lb) /(2*ls) ) * 
90]. Net dissimilarity between Ak and Bk is D(Ak,Bk) = Dp(Ak,Bk) -i- Ds(Ak,Bk). 
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Qualitative type of Ak and Bk: 

For qualitiative type of features the similarity and dissimilarity component due to 
position are absent. The two components that contribute to similarity and 
dissimilarity are span and content. Let la and lb represent number of elements in 
and Bk, inters = number of elements common to and is = span length of Aj^ 
and Bk combined = la + lb - inters. The similarity component due to span is defined 
as Ss(A]i,Bk) = sin[ (la+lb) /(2*ls) *90 ] and the similarity component due to content 
is defined as, Sc(Ak,B|j) = sin[ (inters/ls) * 90]. Net similarity between Aj^ and B^ is 
S(Ak,Bk) = Ss(A|£,Bk) + Sc(A]i,Bj;). The dissimilarity due to span is defined as 
Ds(A|i,B|£) = cos[ (( la + lb) / (2*ls)) * 90], and the dissimilarity due to content is 
defined as, Dc(Ai,,Bi,) = cos[(inters/ls) *90]. Net dissimilarity between At and Bi, is 
D(Ak,Bk) = D,(Ak,Bk) + De(Ak,Bk). 

b. Composite symbolic object: 

Merging is the process of gathering together on the basis of a distance measure, two 
samples and assigning them the same cluster membership, or label for further 
clustering. If the two samples that are merged are to be represented by a single 
sample, one of the frequently used methods is to use the mean of the two as a single 
representative. In symbolic data analysis, the concept of composite symbolic object 
is made use of. The method of forming composite symbolic objects when two 
symbolic objects A^and Bj; are merged is as illustrated below: 

Case I: When the K th feature is quantitative interval type : Let n = number of 
samples, m = mean of the n samples, a = lowest value considering n samples, n; = 
number of samples between a and m, n 2 = number of samples between b and m, n = 
ni + n 2 , a'm = am * njn, b'm = bm * n 2 /n. Here, the length a'b' would represent the 
composite symbolic object. 

Case II: When the K th feature is qualitative nominal : Here, the composite 
symbolic object is defined as the union of and Bj^. 

c. Farthest neighbour concept: 

Gowda[19] has introduced the concept of "farthest neighbour" and successfully 
used it for classification of multispectral data. In a data set, as the nearest neighbour 
B can be found for a sample A using a suitable metric, so also the farthest 
neighbour can be found for sample A. In a sample set, the farthest neighbour of a 
sample A can be defined as the sample C in the set which is at the greatest distance 
from A. In the same way, the farthest neighbour of a set of samples S can be 
defined as the sample which is at the greatest distance from the set S. This of 
course requires the defintion of distance between a set of samples and another 
sample outside the set. The distance d between a sample set S and a sample A 
outside the set is defined as d = a.b where a is the sum of distances from A to the 
samples in S and b is the distance between A and its nearest neighbour from set S. 

3. Algorithm; 

The nonhierarchical symbolic clustering algorithm proceeds as follows: 

1. Let |Xi,X 2 ,...Xn] be a set of N symbolic objects on k features. Let the initial 
number of clusters be N with each cluster having a cluster weight of unity. 

2. Specify the number of clusters required as C. 

3. Compute the dissimilarities between all pairs of symbolic objects in the data 
set as, D(Xi,Xj) = Di(Xi,Xj) + ....+ Dk(X„Xj), where Di(X;,Xj), D 2 (X.,.Xj) 

. . .Di^(X;,Xj) are determined according to the type of the features. 
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4. Compute C farthest neighbours of the symbolic data set by making use of the 
dissimilarity values. 

5. Choose the C farthest neighbours selected as the representative samples of C 
classes. 

6. Make i=0; 

7. Consider sample Xi of the symbolic data set. Compute the similarity between 
Xi and the C representative samples. Assign sample Xi to the class having 
highest similarity. 

8. Make i = i + 1; 

9. If i ==N go to 10 or else go to 7. 

10. Recompute the feature values of the C representative samples as the symbolic 
mean of the samples belonging to each class. 

11. If any symbolic mean has changed its value go to 7 or else stop. 

12. Merge all the samples of each class to form a composite symbolic object which 
would give the description of each class. 

Symbolic Mean: 

1 . For quantitative type of data, the symbolic mean is computed as, al„, =E ;=i to n 
all / n, aUo, =E i = i to n au; / n where al = lower limit of interval, au = upper limit 
of interval, n = total number of samples in a class. 

2. The symbolic mean of qualitative data is computed by taking intoconsideration 
the number of times an attribute gets repeated. 

4. Experimental Results; 

In this section, the performance of the proposed algorithm is tested and evaluated 
using some test data reported in the literature. The data sets used in these 
experiments are synthetic or real data and their classification is known from other 
clustering techniques[7, 10,1 1,12]. In order to compare the results of the proposed 
algorithm, the conventional nonhierarchcial algorithm was applied on the data sets, 
by randomly selecting the initial seed points of the classes. The clusters obtained 
using the proposed method were examined for their validity using Hubert's T 
statistics approach [ 4] and the level of significance values obtained were recorded. 
The simulation experiments are explained below: 

Experiment l:The first experiment is such that the input data is of numeric type 
and the output is symbolic. The objects of numeric type were drawn from a mixture 
of normal distributions with known number of classes and classification so that the 
results show the efficacy of the algorithm for clustering the objects and finding the 
number of classes. The test set is drawn from a mixture of C normal distributions 
with mean mj and covariance matrix C; having individual variances of 0.15 and zero 
covariances. The different values of the number of classes were 2, 3,4, 5, 6,7, 8 and 
the means chosen were (1,3) (1,3,5) (1,3,5,7) (1,3,5,7,9) (1,3,5,7,9,11) 
(1,3,5,7,9,11,13) and (1,3,5,7,9,11,13,15). These test samples were independently 
generated using a Gaussian vector generator. The proposed algorithm was used on 
this test data set. There was a perfect agreement between the number of classes 
used for generating Gaussian clusters and the number of classes obtained by the 
proposed algorithm. In all the seven cases, the classification results were in full 
agreement with the test samples generated and the classes used. 
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Experiment 2: The data set for this example is chosen so as to demonstrate the 
efficacy of the proposed algorithm in clustering data belonging to two classes with 
lots of overlaps. The data set used is the well known iris data set[4]. The proposed 
algorithm was applied on the two class iris data( iris setosa and iris versicolor) 
having 100 samples. The algorithm resulted in two classes which were in perfect 
agreement with the data set considered. For the two classes of iris data set, the 
proposed method took two iterations and resulted in a level of significance of 1 .00. 
On the other hand, the conventional algorithm took three iterations and resulted in a 
level of significance of 1.00. It can be observed that the number of iterations taken 
by the proposed algorithm is less compared to the conventional algorithm. 
Experiment 3:The data set[8] used for this problem consists of data of fats and oils 
having four quantitative features of interval type and one qualitiative feature. The 
proposed algorithm was applied on this data set specifying two classes. The 
samples of the two classes obtained were as follows: {0,1,2,3,4,5} and {6,7}. For 
the fat oil data set, the proposed algorithm took two iterations and resulted in a 
level of significance of 0.98. On the other hand, the conventional algorithm took 
three iterations and resulted in a level of significance of 0.78. It can be observed 
that the proposed method gives a higher level of significance value and also takes 
less number of iterations compared to the conventional algorithm. 

Experiment 4:The data set of microcomputers [8] is considered for this experiment. 
The proposed algorithm was tested on this data set by specifying the number of 
classes as two. The. proposed algorithm took two iterations and resulted in a level 
of significance of 0.90. On the other hand, the conventional algorithm took three 
iterations and resulted in a level of significance of 0.82. It can be observed that the 
proposed method gives a higher level of significance values and takes less number 
of iterations compared to the conventional algorithm. 

Experiment 5:The data set of microprocessors[9] is considered for this 
experiment. The proposed algorithm was tested on this data set by specifying the 
number of classes as three. The samples of the three classes obtained were, 
{0,1, 4,5} {3,7,8} and {2,6}. For the microprocessor data set, the proposed 
algorithm took three iterations and resulted in a level of significance of 0.91. On the 
other hand, the conventional algorithm took four iterations and resulted in a level of 
significance of 0.84. It can be observed that the proposed method gives a higher 
level of significance value and takes less number of iterations compared to the 
conventional algorithm. 

Experiment 6:The data set for this experiment is considered from Botany [8]. It 
consists of 9 trees belonging to 3 classes. The proposed algorithm for three class 
case resulted in the following : {0,1,2} {3,4,5} and {6,7,8}. For the botanical data, 
the proposed algorithm took two iterations and resulted in a level of significance of 
1.00. On the other hand, the conventional algorithm took three iterations and 
resulted in a level of significance of 1.00. It can be observed that the proposed 
method takes less number of iterations compared to the conventional algorithm. 
From the experimental results, it can be seen that the proposed algorithm shows an 
improvement over the conventional algorithm both in terms of quality of clustering 
obtained and the number of iterations required. 
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5. Conclusion; 

A nonhierarchical clustering procedure for symbolic objects is presented. In the 
first stage, the initial seed points are selected using the concept of farthest 
neighbours. During suceeding stages, the seed points are computed iteratively until 
the seed points get stabilised. The proposed nonhierarchcial clustering procedure 
works on symbolic data of mixed feature types consisting of quantitaitve ( ratio, 
absolute, interval) and qualitiative (nominal, ordinal, combinational) values. 
Several artificial and real life data with known number of classes and classification 
assignments are used to establish the efficacy of the proposed algorithm and the 
results are presented. 

References 

1. E.Diday and J.C. Simon, Clustering Analysis .'Communication and Cybernetics, Vol 10, 
NewYork, Springer Verlag, 1976, pp. 47 - 92. 

2. E.Diday, C.Hayashi, M.Jambu and N.Ohsumi, Eds, Recent developments in clustering 
and data analysis, NewYork: Academic, 1987. 

3. H.H.Bock, Ed, Classification and related methods of data analysis, Amsterdam: North 
Holland, 1987. 

4. A.K.Jain and R.C.Dubes, Algorithms for clustering data, Englewood Cliffs, NJ: 
Prentice Hall, 1988. 

5. E.Diday, Ed., Data analysis, learning symbolic and numeric knowledge, Antibes, 
Erance: Nova Science Publishers, 1989. 

6. R.O.Duda and P.E.Hart, Pattern classification and scene analysis, NewYork: Wiley 
Interscience, 1973. 

7. K.C.Gowda and E.Diday, "Symbolic clustering using a new dissimilarity measure". 
Pattern Recognition, Vol 24, No. 6, pp. 567 - 578, 1991. 

8. M.Ichino, "General metrics for mixed features - The cartesian space theory for pattern 
recognition", in proc. IEEE Conf. Systems, Man and Cybernetics, Atlanta, GA, pp. 14- 
17, 1988. 

9. M.Ichino and H.Yaguchi, "General Minkowsky metric for mixed feature type", IEEE 
transactions on Systems, Man and Cybernetics, Vol 24, pp. 698-708, 1994. 

10. K.C.Gowda and E.Diday, "Symbolic clustering using a new similarity measure", IEEE 
transactions on Systems, Man and Cybernetics, Vol 22, No. 2, pp. 368-378, 1992. 

11. K.C.Gowda and T.V.Ravi,"Agglomerative clustering of symbolic objects using the 
concepts of both similarity and dissimilarit^htter'M Recognition Letters 16 (1995), 
pp. 647-652. 

12. K.C.Gowda and T.V.Ravi, "Divisive clustering of symbolic objects using the concepts 
of both similarity and dissimilarity". Pattern Recognition, Vol 28, No. 8, pp. 1277-1282, 
1995. 

13. E.Diday, The symbolic approach in clustering, classification and related methods of 
data analysis, H.H.Bock, Ed. Amsterdam, The Netherlands: Elsevier, 1988. 

14. D.H.Eisher and P. Langley, "Approaches to conceptual clustering", in Proc. 9 th 
International Joint Conference on Artificial Intelligence, Los Angeles, CA, 1985, pp. 
691 -697. 

15. R.Michalski, R.E. Stepp and E.Diday, " A recent advance in data analysis: clustering 
objects into classes characterized by conjuctive concepts," Progress in Pattern 
Recognition, Vol 1, L. Kanal and A. Rosenfeld, eds (1981). 

16. R.Michalski and R.E.Stepp, "Automated constmction of classifications: Conceptual 
clustering versus numerical taxonomy", IEEE transactions Pattern Analysis and 
Machine Intelligence, PAMI - 5, pp. 396 -410, 1983 

17. Y. Cheng and K.S.Fu, "Conceptual clustering in knowledge organisation", IEEE 
transactions Pattern Analysis and Machine Intelligence, PAMI - 7, pp. 592-598, 1985. 




A New Nonhierarchical Clustering Procedure 41 



18. D.H.Fisher, "Knowledge acquisition via incremental conceputal clustering", Machine 
Learning, No. 2, pp. 103-138, 1987. 

19. K.C. Gowda, " A feature reduction and unsupervised classification algorithm for 
multispectral data". Pattern Recognition, Vol 17, No. 6, pp 667 - 676, 1984. 




Quantization of Continuous Input Variables 
for Binary Classification 



Michal Skubacz^ and Jaakko Hollmen^ 

^ Siemens Corporate Technology, Information and Communications, Neural 
Computation, 81730 Munich, Germany, Michal.Skubaczamchp.siemens.de 
^ Helsinki University of Technology, Laboratory of Computer and Information 
Science, P.O. Box 5400, 02015 HUT, Finland, Jaakko.Hollmenahut.fi 



Abstract Quantization of continuous variables is important in data 
analysis, especially for some model classes such as Bayesian networks 
and decision trees, which use discrete variables. Often, the discretiza- 
tion is based on the distribution of the input variables only whereas 
additional information, for example in form of class membership is fre- 
quently present and could be used to improve the quality of the results. 
In this paper, quantization methods based on equal width interval, max- 
imum entropy, maximum mutual information and the novel approach 
based on maximum mutual information combined with entropy are con- 
sidered. The two former approaches do not take the class membership 
into account whereas the two latter approaches do. The relative merits 
of each method are compared in an empirical setting, where results are 
shown for two data sets in a direct marketing problem, and the quality 
of quantization is measured by mutual information and the performance 
of Naive Bayes and C5 decision tree classifiers. 



1 Introduction 

Whereas measurements in many real-world problems are continuous, it may be 
desirable to represent the data as discrete variables. The discretization simpli- 
fies the data representation, improves interpretability of results, and makes data 
accessible to more data mining methods [6]. In decision trees, quantization as 
a pre-processing step is preferable to local quantization process as part of the 
decision tree building algorithm [1,4]. In this paper, quantization of continu- 
ous variables is considered in a binary classification problem. Three standard 
quantization approaches are compared to the novel approach, which attempts to 
balance the quality of input representation (measured by entropy) and the class 
separation (measured by mutual information). 

The comparison of the four approaches to quantization is performed on two 
data sets from a direct marketing problem. Mutual information. Naive Bayes 
classifier, and C5 decision tree [8] are used in measuring the quality of the quan- 
tizations. 
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2 Quantization 

Quantization, also called discretization, is the process of converting a continuous 
variable into a discrete variable. The discretized variable has a finite number of 
values (J), the number usually being considerably smaller than the number of 
possible values in the empirical data set. In the binary classification problem, a 
data sample (xj, yi)^^ and yi G {0, 1} is available. Variable Xj G is a vector 
of variables on a continuous scale. In the quantization process, the component k 
of the Xi later denoted by Xik, is mapped to the discrete counterpart x^/, when 
the original variable Xik belongs to the interval defined by the lower and upper 
bounds of the bin. The number of data falling into a bin j is defined as Ukj and 
the probability of a bin as pkj = 

One could approach the discretization process in many different ways, start- 
ing for example from naive testing of random configurations and selecting the 
best one for a particular problem. More structured approaches may consider 
discretizing all variables at the same time (global), or each one separately (lo- 
cal). The methods may use all of the available data at every step in the process 
(global) or to concentrate on a subset of data (local) according to the current level 
of discretization. Decision trees, for instance, are usually local in both senses. 
Furthermore, two following search procedures could be employed. The top-down 
approach [4] starts with a small number of bins, which are iteratively split fur- 
ther. The bottom- up approach [5], on the other hand, starts with a large number 
of narrow bins which are iteratively merged. In both cases, a particular split or 
merge operation is based on a defined performance criterion, which can be global 
(defined for all bins) or local (defined for two adjacent bins only). An example 
of a local criteria was presented in [5]. 

In this paper, a globally defined performance criterion is optimized using a 
greedy algorithm. In each iteration of the one-directional greedy algorithm, a 
most favorable action at the time is chosen. In the initial configuration one allo- 
cates a large number of bins to a variable and starts merging two adjacent bins 
by choosing the most favorable merge operation. The approaches used in this 
paper are local in the sense that variables are discretized separately and global 
in the sense that all the available data are used in every step of the quantization 
process. Discretizing variables separately assumes independence between them, 
an assumption which is usually violated in practice. However, this simplifies the 
algorithms and makes them scalable to large data sets with many variables. In 
contemporary data mining problems, these attributes become especially impor- 
tant. In a real situations, one particular value on the continuous scale may occur 
very frequently overwhelming the entire distribution of the variable. For exam- 
ple, the field "total length of the international telephone calls" for a particular 
private customer is likely to be predominately filled with zeros. This situation 
corresponds to a peak in the probability density function and can lead to the 
deterioration of the quantization process. If this is detected, for example by 
checking if a given value appears in more than 60% of the samples, a dedicated 
interval should be allocated and these samples removed from the discretization 
process. 
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Equal Width Interval By far the simplest and most frequently applied method 
of discretization is to divide the range of data to a predetermined number of bins 
[6]. Each bin is by construction equally wide, but the probabilities of the bins 
may vary according to the data. In classification problems, this approach ignores 
the information about the class membership of data assigned to each bin. 

Maximum Entropy An alternative method is to create bins so that each bin 
equally contributes to the representation of the input data. In other words, 
probability of each bin for the data should be approximately equal. In fact, this 
is achieved by maximizing the entropy of the binned data. The entropy for the 
binned variables may be defined as Hi- = Pkj log Pit j, where the sum is over 

all bins. Entropy has been used in context of discretizing variables in [9]. 



Maximum Mutual Information In classification problems, it is important to op- 
timize the quantized representation with regard to the distribution of the output 
variable. In order to measure information about the output preserved in the dis- 
cretized variable, mutual information may be employed [3]. Mutual information 
was used in the discretization process of the decision tree construction algorithm 
(IDS) in [7]. Mutual information is measured in terms of quantized variables as 
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Maximum Mutual Information with Entropy By combining the maximum en- 
tropy and the mutual information approaches, one hopes to obtain a solution 
with the merits of both. This should strike a balance between the representation 
of the input and the knowledge of the output variable at the same time. In other 
words, one would like to retain balanced bins that turn out to be more reliable 
(prevent overfitting in this context) but simultaneously to optimize the binning 
for classification. Our greedy algorithm is based on a criterion function which is 
the product of the mutual information and the maximum entropy as 

^k ~ Pk I k • 

The greedy algorithm approximates the gradient ascent optimization. Writing 
the gradient of the product of two functions as _ f'(^x;6)g{x;6) + 

g'{x\6)f{x\6), we note that the search direction is driven by the balance of 
the two factors subject to constraints imposed by data. A similar measure in- 
volving mutual information divided by entropy was proposed in the context of 
discretization in [8]. However, the measure was used for the problem of binary 
discretization in splitting operation. Our novel approach assumes discretization 
into several bins and the comparison is done among all merging operations. 



3 Experiments 

Two data sets were used in the evaluation. Both of them were collected and 
used in direct marketing campaigns. The input variables represented customer 
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information and the output was the customer’s binary response. The data set 1 
consisted of 144 input variables and 12496 samples whereas the data set 2 had 75 
input variables and 35102 samples. The first data set was artificially balanced to 
contain an equal number of positive and negative responses, in the second data 
set only one tenth of the samples belonged to the positive response class as in 
usually strongly imbalanced direct marketing problems. The evaluation criteria 
used for measuring the influence of the discretization procedure on the classifi- 
cation problem were mutual information, predicted 50 % response rate based on 
Naive Bayes, and classification accuracy of C5 classifier. Each experiment was 
conducted with a randomly selected training set and a testing set of the same 
size, the results shown are based on the testing set. All the experiments were 
repeated 25 times. In the case of mutual information, all the variables of each 
data set were discretized and the mutual information of the discretized variable 
and the output variable were measured on the test data. From each data set, 10 
most relevant variables were chosen and in order to create different subproblems 
randomly selected subsets of four variables were used for building classifiers. Us- 
ing response rate together with the Naive Bayes, the possibly imbalanced class 
priors present in the data do not have any effect. In C5 classifier, a fixed cost 
matrix was given to flatten out the imbalanced class distribution. All the exper- 
iments were repeated with the goal of discretizing the continuous variables to 4, 
6, and 8 bins. The results are shown in terms of relative performance in Fig 1. 

4 Discussion 

Measuring the relative scores by mutual information, the approaches that take 
into account the class membership of data prove to be superior. Ranking of the 
methods remains the same in both the balanced and the imbalanced data sets. 
In general, the addition of bins improves the performance of the discretization 
methods. Moreover, the mutual information approach is better than the novel 
method in case of low number of bins, whereas the novel method was superior 
when the number of bins was bigger, even though mutual information is used 
as the assessment measure. The importance of the entropy term in the novel 
method increases along the number of bins. Of the simple methods, which ignore 
the available output information in the classification problem, the entropy-based 
method is better than the equal width interval method. 

Using 50 % response rate based on Naive Bayes classifier, results are some- 
what more difficult to interpret. In this case it is important to note that each 
variable is treated separately, which is likely to increase the independence of the 
discretized variables compared with the original ones. It seems that the novel 
method is superior to all other methods, although the large variance on the esti- 
mates makes this subject to a debate. For example, in the case of data set 1 and 
the experiment with eight intervals, the median of the novel method is the best, 
75 % confidence interval is similar to others, and Anally the 95 % confidence lim- 
its are much worse than in the case of mutual information. On the other hand, 
the median performance of the novel method proves to be the best in most cases. 
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Figurel. Relative scores of the discretization methods measured mutual information 
are shown for the data set 1 (first row, left panel) and data set 2 (first row, right 
panel). Relative scores of the discretization methods measured by 50 % response rate 
of Naive Bayes are shown for the data set 1 (second row, left panel) and data set 2 
(second row, right panel). Relative scores of the discretization methods measured by 
classification performance achieved with C5 classifier are shown for the data set 1 (third 
row, left panel) and data set 2 (third row, right panel). In all figures, the horizontal 
axis is divided to three sections for experiments with four, six and eight bins. The 
order of discretization methods in each section is equal width interval (1), maximum 
entropy (2) , maximum mutual information (3) , and maximum mutual information with 
entropy(4). The performance of repeated experiments are visualized with median, 25 % 
and 75 % percentiles. In addition, 95 % confidence interval is shown with dashed lines. 
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In case of a C5 classifier, none of the methods outperforms others, especially 
on the first data set. The variance of the estimates is also relatively large as to 
make accurate judgments. In the second data set, however, equal width interval 
approach is clearly worse than the other presented methods. One possible reason 
for the questionable performance of the tree classifier could be that our discretiza- 
tion works for each input variable separately, whereas optimal creation of the 
decision tree would take into account the interdependencies between variables. 
Using the novel discretization method, these interdependencies are essentially 
ignored and the solution is likely to weaken the interdependencies between dis- 
cretized input variables. Taking all the variables into account at the same time 
may be seen beneficial in this context as proposed in [2]. 

5 Summary 

Methods for quantizing continuous input variables in classification problems were 
presented. Relative merits of the equal width interval, maximum entropy, maxi- 
mum mutual entropy and the novel maximum mutual information with entropy 
approaches were compared with two data sets from direct marketing problems 
using three criteria. Concluding, none of the tested approaches would be pre- 
ferred over others whenever the C5 decision tree is to be used for modeling. On 
the other hand, the novel method proposed in this paper would be recommended 
for Naive Bayes classifiers where it may lead to performance improvements. 
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Abstract. Emerging patterns (EPs) are knowledge patterns capturing 
contrasts between data classes. In this paper, we propose an information- 
based approach for classification by aggregating emerging patterns. The 
constraint-based EP mining algorithm enables the system to learn from 
large-volume and high-dimensional data; the new approach for selecting 
representative EPs and efficient algorithm for finding the EPs renders 
the system high predictive accuracy and short classification time. Ex- 
periments on many benchmark datasets show that the resulting classi- 
fiers have good overall predictive accuracy, and are often also superior to 
other state-of-the-art classification systems such as C4.5, CBA and LB. 



1 Introduction 

Classification has long been studied within the machine learning community. 
Recently, it has also attracted attention from the data mining community, which 
focused on important issues such as ways to deal with large volume of training 
data and to achieve higher predictive accuracy [7, 3, 6 ]. In this paper, we address 
these issues by proposing a classifier, iCAEP, which performs information-based 
Classification by Aggregating Ehierging Patterns. 

Emerging patterns (EPs) are multi- variate knowledge patterns capturing dif- 
ferences between data classes [2]. For example, ei and C 2 are EPs of the Benign 
and Malignant classes, of the Wisconsin-breast-cancer dataset ^ respectively: 
ei={ (Bare-Nuclei , 1) , (Bland-Chromatin,3) , (Normal-Nucleoli , 1) , (Mitoses , 1) } 
e 2 ={Clump-Thickness , 10)}. Their supports in the whole dataset and in the two 
classes, and their growth rates (support ratios) are listed below. ^ 



EP 


support 


Malignant support 


Benign support 


growth xate 


ei 


13 . 45’/. 


0.41’/. 


20.31’/. 


49.54 


62 


28 . 63’/. 


28 . 63’/. 


0’/. 


00 



In a classification task, EPs of a class can be seen as the distinguishing 
features of the class, whose power is indicated by their support and growth rate. 
In the above example, given a test instance T, if T only contains 62 , we tend to 
assign T to the Malignant class. However, intricacy arises when T contains EPs 
of both the Benign and Malignant classes. In iCAEP, rather than relying on a 
single EP to make decision, we select representative EPs from all the EPs that 
appear in T for a more reliable decision. Each EP is seen as a message indicating 
class bias of T. By aggregating the contribution of EPs in an information-based 

^ http : //wHH. ics . uci . edu/ ~mlearn/MLRepository . html 
^ See § 2 for definitions. 
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approach, we reach a reliable classihcation decision, taking into consideration 
the predictive power of EPs and the unbalanced densities of EPs in each class. 

We also aim to tackle large- volume and high-dimensional classification tasks. 
For example, the UCI Connect-4 dataset, consisting of 67,557 instances de- 
fined by 42 attributes, is a great challenge to mining algorithms. In iCAEP, 
at the training stage, the constraint-based mining algorithm ConsEPMiner [12] 
is adapted to efficiently mine EPs from large high-dimensional datasets; at the 
classification stage, we propose a new approach to select representative EPs and 
an algorithm to efficiently find such EPs for classifying a test instance. 

Experiments show that, iCAEP is often superior to other state-of-the-art 
classifiers in classification accuracy, the constraint-based EP mining approach 
significantly improves training time, the new classification strategy significantly 
improves classification time. iCAEP can successfully learn from challenging real- 
world large- volume and high-dimensional training datasets, including Connect-4. 

2 Information-based Classification by Aggregating EPs 

We assume that datasets are relational tables, where each instance takes a value 
for each attribute. We preprocess the datasets as follows: Continuous attribute 
domains are first discretized into disjoint intervals, and each attribute value is 
mapped to the interval containing it. An (attribute, interval) or (attribute, 
value) pair is mapped into an item. An instance T defined by n attributes 
is then mapped to an itemset of n items: {ui, 02 , ..., a„}. The support of an 
itemset A in a dataset D, suppoiX), is _ Given background dataset 

D' and target dataset D" , the growth rate of an itemset X from D' to D" is 
GR{X) = (“define” § = 0 and ^ = 00 ); EPs from D' to D'\ or 

simply EPs of D" , are itemsets whose growth rate is greater than some given 
threshold p (p > 1). A training dataset D of m classes is partitioned into Z?i, 
..., where Di consists of training instances with class label Ci. The EP set 
for class Ci, Ei, consists of EPs from D — Di Xo Di. 

We employ the minimum encoding inference approach to classify a test in- 
stance, using EPs appearing in the instance as messages. According to the Min- 
imum Message Length (MML) [10] or Minimum Description Length (MDL) [9] 
principle, to assess the quality of a model rrii for a dataset, we construct a de- 
scription of the model and of the data in terms of the model; a good model is 
one leading to a concise total description, i.e., its total description length for the 
theory and data encoded under the theory is minimum. For a training dataset 
D of m classes, the set of EPs E for all classes, E = EiU E 2 U ...U Em, forms 
a model (theory) for the dataset. For a test instance T, the theory description 
length is the same, but the encoding length of T under different class assump- 
tions can be different. An EP of Ej, with its contrasting high support in Dj and 
low support in Di (i yf j), is a message whose encoding cost is the smallest in 
Cj among all Cfc(l < k < m). We select representative EPs from m classes to 
encode T (see § 3.2): 

m 

= [J eJ, eJ = {Xk G Ej\k = l..pj} is a partition of T 
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Note that we consider representative EPs of all classes to decide T’s label 
and this ensures that our decision will not be influenced by the differences in EP 
sets Ej {j = such as the number, density of EPs, etc. 

Following [11], if a message Xk occurs with probability Pk, we postulate an 
encoding scheme where the cost of encoding Xk requires —log 2 {Pk) bits. The 
encoding length of T under the assumption of class Ci is defined to be the total 
cost of encoding EPs in 

p 

L{T\\a) = -Y,^og2P{Xk\a), Xk e 

k=i 

We assign T the class label Ci where L{T\\Ci) is the minimum. From a prob- 
abilistic point of view, we are “estimating” and ranking P{T\Ci) by aggregating 
the EPs in a test instance. If T indeed belongs to class Ci, P(T\Ci) should be the 
maximum; since P(T\Ci) is proportional to —L(T\\Ci), P{T\Ci) is the maximum 
when L{T\\Ci) is the minimum. 

2.1 Estimating probability 

The support (observed frequency) of an itemset can be an unreliable substitution 
for its probability, especially when the data sets are small or when the training 
data contains noise. A typical example is a special type of EPs — jumping 
EPs with a growth rate of oo(= ^). To eliminate unreliable estimates and zero- 
supports, we adopt standard statistical techniques to incorporate a small-sample 
correction into the observed frequency. In our experiments, given an itemset X, 

we approximate P{X\Ci) by #c +2 where #(A A Ci) is the number 

of training instances belonging to class Ci and containing X, is the total 
number of training instances containing X, is the total number of training 
instances, and #Ci is the number of training instances for class Ci. 

3 Algorithms 

We describe here the algorithms for efficiently mining EPs at the learning stage 
and for finding representative EPs for classification at the classification stage. 

3.1 Constraint-based approach to mining EPs 

On large and high-dimensional datasets, a large number of long EPs appear 
and this brings challenge to the Apriori [1] mining framework — the num- 
ber of candidate EPs grow combinatorially. Given target support threshold 
minsupp, growth rate threshold minrate, growth-rate improvement threshold 
minrateimp, ( for an EP e, the growth-rate improvement of e, rateimp{e) is 
min{\/e' C e,GR{e) — CR{e')), ) ConsEPMiner [12] uses all constraints for ef- 
fectively controlling the blow-up of candidate EPs and can successfully mine 
EPs at low support threshold from large high-dimensional datasets; especially, 
the growth-rate improvement threshold ensures a concise resulting EP set that 
represents all EPs. In iCAEP, we made the following extensions to ConsEP- 
Miner for classification purpose: (I) In mining the EPs for class Ci, we compute 
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their support in target Di, background D — Di, as well as all the other datasets 
Dj, j 7 ^ i- (2) All single-item itemsets are in the EP set of each class, whether 
they satisfy the given thresholds or not. This ensures that we can always find a 
partition for an instance. 

3.2 Selecting representative EPs for classification 

Given a test instance T, a partition of T is a set of itemsets {Xi , ..., Xp}, where 
= T and n Xk^ = 4>{k\ yf fc 2 )- From Ej, the set of EPs for class 
Cy, we select disjoint EPs to form the representative EP set Ej to encode T. 
Long itemsets with many items are obviously preferred, as they provide more 
information about the higher order interactions between attributes. Among the 
itemsets of equal length, we prefer those of larger growth rate. We organize Ej 
in a hash tree. EPs are hashed hrst on their length and then their first item. 
At each hash node, EPs are organized in decreasing order of their growth rate. 
We search the hash tree in decreasing order of length and decreasing order of 
growth rate. This ensures that we always find the most representative EPs of 
the relevant subtree for the items of T that do not appear in any so-far selected 
EPs. The search finishes when a partition of T is found. 

SelectEP(test instance T, the hash tree Tr for Ej) 

;; return Ej , the set of representative EPs from Ej to encode T 

1) ll ^ \T\; F ^ (f)) 

2) while T ^ <f) do 

3) for each item t ^ T do 

4) if 3X C T at the node Tr[ll][t] then 

5) found ^ 1, F ^ F U{X}, T ^ T - X , hreaF.-, 

6) if found=l then ll *— the smaller of |T| and ll; 

7) else ll ll — 1] 

8) retrun F; 

Fig. 1. Algorithm SelectEP 

4 Experiments 

We use 24 UCI datasets to evaluate iCAEP. We compare it with CAEP [3], the 
first EP-based classifier. Naive Bayes(NB), a surprisingly successful classifier 
compared with more complicated classifiers, C4.5, the widely-known decision 
tree classifier, and two recent classifiers from the data mining community: LB [8] , 
an extension to NB based on frequent itemsets, and CBA [6], a classifier using 
association rules [1] for classification. Entropy-based discretization [4] is used 
to discretize continuous attributes, where the code is taken from the MLC-I--I- 
machine learning library. ^ 

Table 1 describes the datasets and presents the accuracy of different classi- 
fiers. The accuracy of iCAEP, CAEP and NB are obtained on exactly the same 
10-fold cross validation (CV-10) data. Note that the accuracy of iCAEP and 
CAEP is obtained without fine tuning any parameters, whereas that of CBA 
and LB is according to reported CV-10 result in literature. Numbers in bold are 
the best accuracy for each dataset and ‘ — ’ indicates that the result is not avail- 
able in literature. In the CV-10 experiments of CAEP and iCAEP, ConsEPMiner 

® http: //www. sgi . com/Technology/mlc/ 
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Table 1. Description of datasets and summary of accuracy 





Dataset Properties 


Accuracy(%) 


Dataset 


Size 


#Cls #Attr 


iCAEP CAEP NB 


C4.5 


CBA 


LB 


Adult 


45,225 


2 


14 


80.88 


83.09 


83.81 


85.4 


75.21 


— 


Annealing process 


898 


6 


38 


95.06 


85.74 


96.99 


90.4 


98.1 


— 


Australian credit approval 


690 


2 


14 


86.09 


78.55 


85.51 


84.28 


85.51 


85.65 


Breast cancer{Wisc.) 


699 


2 


9 


97.42 


97.00 


97.14 


95.42 


95.28 


96.86 


Chess(k_rook-k_pawn) 


3,169 


2 


36 


94.59 


85.45 


87.92 


99.5 


98.12 


90.24 


Connect-4 


67,557 


3 


42 


69.90 


72.97 


72.13 


— 


— 


— 


German 


1.000 


2 


20 


73.10 


73.30 


74.40 


71.7 


73.2 


74.8 


Haberman 


306 


2 


3 


71.52 


71.52 


69.91 


— 


— 


— 


Heart diseasse (Cleve.) 


303 


2 


13 


80.25 


82.52 


82.87 


78.4 


81.87 


82.22 


Hepatitis prognosis 


155 


2 


19 


83.33 


81.96 


84.62 


81.6 


80.2 


84.5 


Hypothyroid diagnosis 


3,163 


2 


25 


96.40 


96.49 


98.42 


98.8 


98.4 


— 


Ionosphere 


351 


2 


34 


90.60 


87.21 


89.45 


92 


92.1 


— 


Iris 


150 


3 


4 


93.33 


94.00 


93.33 


94.7 


92.9 


— 


Labor 


57 


2 


16 


89.67 


79.33 


86.33 


79 


83 


— 


Lymphography 


148 


4 


18 


79.76 


74.38 


78.33 


78.39 


77.33 


84.57 


Mushroom 


8124 


2 


22 


99.81 


93.04 


99.68 


— 


— 


— 


Nursery 


12,961 


5 


8 


84.66 


84.37 


90.28 


— 


— 


— 


Pima 


768 


2 


8 


72.27 


73.30 


74.74 


72.5 


73.1 


75.77 


Solar flare (X class) 


1,388 


3 


10 


92.00 


89.34 


96.32 


84.4 


— 


— 


Spambase 


4,601 


2 


57 


91.18 


86.42 


89.87 


— 


— 


— 


Tic-tac-toe 


958 


2 


9 


92.06 


85.91 


70.15 


86.3 


100 


— 


Vehicle 


846 


4 


18 


62.76 


55.92 


59.57 


69.82 


68.78 


68.8 


Waveform 


5,000 


3 


21 


81.68 


83.92 


80.76 


70.4 


75.34 


79.43 


Wine 


178 


3 


13 


98.89 


96.08 


89.90 


87.9 


91.6 


— 



is employed to mine EPs, with the following settings: minsupp = 1% or a count 
of 5, whichever is larger, minrate = 5, minrateimp = 0.01. We also limit the 
size of EP set of each class to 100,000 EPs. At the classification stage, the base 
normalization score for CAEP is set to 85%. We can draw several conclusions 
from the table: (1) In terms of the best accuracy for each dataset, iCAEP wins 
on 7 datasets, whereas C4.5, NB, CAEP CBA and LB win on 5, 4, 3, 3 and 3 
datasets respectively. (2) In terms of overall performance, the average accuracy 
of iCAEP, CAEP and NB is 85.72%, 82.99%, and 84.68%; iCAEP is the best. 
(3) With ConsEPMiner, both iCAEP and CAEP can successfully classify all 
datasets, including the challenging large-volume high-dimensional datasets like 
Connect-4 (67,557 instances, 42 attributes, 126 items) and Spambase (4,601 in- 
stances, 57 attributes, 152 items). Compared with CAEP, at the training stage, 
iCAEP saves the time for calculating the base score for normalization; at the 
classification stage, the new algorithm for selecting representative EPs reduces 
classification time by 50%. 

5 Related Work 

The closest related work is CAEP, which is also based on the idea of aggre- 
gating EPs for classification. In iCAEP, (1) we aggregate EPs in a different 
approach, the information-based approach; (2) with the constraint-based EP 
mining algorithm, we can build classifiers more efficiently and can handle large 
high-dimensional datasets more effectively; (3) when classifying an instance T, 
we select a smaller but more representative subset of EPs presented in T. Experi- 
ments show that, compared to CAEP, iCAEP has better classification accuracy, 
and shorter time for training and classification. The current experiments are 
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based on default parameter settings for both algorithms; it would be interesting 
to further compare their performance after both are fine-tuned. 

Jumping EP-based classifiers [5] consider only jumping EPs for classification. 
Considering finite growth rate EPs and jumping EPs, and treating jumping EPs 
(§ 2.1) more carefully, iCAEP will behave well in the presence of noise. In large 
and high-dimensional datasets, a huge number of long jumping EPs are present 
and this brings difficulty to jumping EP-based classifiers. 

Clearly, iCAEP is different from decision tree classifiers, Bayesian family 
classifiers, or the association classifier CBA. Specifically, CBA and LB, the two 
classifiers from the data mining community, construct classifiers based on the 
Apriori framework, which is usually too slow to be useful for real-world large 
high-dimensional data [12]; CBA and LB use frequent itemsets for classification 
in a completely sequential way. More generally, the MML or MDL principle has 
been proved successful in many applications. 

6 Conclusions 

We have presented a classifier iCAEP, based on Emerging Patterns (EPs). iCAEP 
classifies an instance T by aggregating the EPs learned from training data that 
appear in T in an information-based approach. Experiments on many datasets 
show that, compared to other classifiers, iCAEP achieves better classification 
accuracy, and it scales well for large- volume high-dimensional datasets. In our 
future work, we will focus on tuning iCAEP for higher accuracy. 
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Abstract. The paper considers applying a boosting strategy to opti- 
mise the generalisation bound obtained recently by Shawe-Taylor and 
Cristianini [7] in terms of the two norm of the slack variables. The formu- 
lation performs gradient descent over the quadratic loss function which 
is insensitive to points with a large margin. A novel feature of this algo- 
rithm is a principled adaptation of the size of the target margin. Exper- 
iments with text and UCI data shows that the new algorithm improves 
the accuracy of boosting. DMarginRoost generally achieves significant 
improvements over Adaboost. 

1 Introduction 

During the last decade new learning methods so called, ensemble methods, have 
gained much attention in machine learning community. These methods generally 
produce a classifier with a high learning accuracy. It has recently been established 
that boosting can be viewed as gradient descent in the function space based on 
a criterion derived from the margins of the training examples [2] , [3] . 

The standard boosting algorithm Adaboost optimises a negative exponen- 
tional function of the margin that corresponds most closely to the hard margin 
criterion for Support Vector Machines. This idea generalises well in a low noise 
environment but fails to perform well in noisy environments. This opens up the 
need for developing new boosting techniques which are robust to noise [3], [6]. 
For SVM the generalisation results are improved by optimising the 2-norm of 
the slack variables in the corresponding optimisation problem [5]. This approach 
has recently been placed on firm footing by Shawe-Taylor and Cristianini [7]. 

Mason et al. [3] give a general formulation of how alternative loss functions 
give rise to a general boosting strategy. This paper adopts this approach with one 
additional feature: the loss function used depends on the target margin since it 
is relative to this margin that the slack variables are calculated. Since fixing the 
size of the target margin is difficult a priori, we introduce a method of adapting 
its value in response to the progress of the boosting algorithm. This paper adapts 
the new generalisation bound for boosting. It propose a new algorithm, develops 
a new loss function and shows that the strategy of performing gradient descent 
in function space relative to an adapting loss function provides a highly accurate 
classifier which is robust to noise. 
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2 2-norm Soft Margin Bonnd for boosting 



The function classes that we will be considering are of the form co(H) = 
{^h£H ^hh} , where H is a set of weak learners. We first introduce some nota- 
tion. Tf I? is a distribution on inputs and targets, X x { — 1, 1}, we define the error 
errx>(/) of a function f G X to he the probability D{(x, t/) : sgn(/(x)) yt j/}, 
where we have assumed that we obtain a classification function by thresholding 
at 0, if / is real-valued. 

For the background theory required for the derivation of the bound see Def- 
inition 2.1, Theorem 2.1, Definition 2.2, Definition 2. ,3, Theorem 2.2 in [1]. Ry 
applying these results to our function class which will be in the form described 
above, X — co(ff) - {E;, o/,/) } , where we have left open for the time being 

what the class H of weak learners might contain. The sets Q of Theorem 2.2 
in [ 1 ] will be chosen as follows Qb — { (E/igH h,g) ■■ T.heH I«aI + II.9II2 < • 

For two classes Qi and Q 2 of real valued functions, we denote by Q\ + 1/2 the 
class Qi + G 2 — {/i + /a : /i £ Gi, f 2 £ Sa}- Furthermore, by taking the sums 
of functions from coverings of Gi and G 2 at scales r\ and 7 — ?) respectively, we 
obtain a covering of Gi + 62 at scale 7 . 

If we let Tb be the class Tb — {EftgH '■ 'Yl,heH I®** I ^ class H be 

defined as Hb — {.9 £ ||. 9 ||a ^ ) then clearly we have Gb C Tb TUb- 

Hence, we can obtain a bonnd on the covering numbers for Gb ss 

J^{Gb, rn, 7 ) < jV(TB,m, t])X {'HB, m, 7 - 9 ) 



We can therefore bound the covering numbers of Gb hy bounding those of Tb 
and TiB- The techniques presented in [1] can be used to obtain the following 
bonnd by [8]. 



Theorem 1. For the class Tb defined above we have that 

144H^ 

log7V(5fj3,m,7) < 1 + — {2 + ]n{RH{m)j) 



log 2 



4B 

— + 2 

7 



m + I 



where Buirn) is the maximum number of dichotomies that can be realised by H 
on m, points. 

The covering numbers of the second set can be got by applying result in [8]. 
Theorem 2. For the class BIb defined above we have that 

36 / 

log7V(?^B,m,7) < — ^log (2 

Hence, the optimisation of the generalisation bonnd will be done if we minimise 
logA/"(?^B, fTj, 7 ), which is achieved by minimising — h Taking r] — 

27/3 gives 648BV7^ = 648 ^^-T^^il)-^ where X>( 7 ) ^ ^EEi > /> 7)^ 

A — E/igi? \^h\- Minimisation of this quantity will motivate the DMarginRoost 

1 

algorithm where we will use the parameter C in place of — . 
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3 The DMarginBoost Algorithm 



The DMarginBoost algorithm applies the idea of minimising the target margin 
based loss function. The algorithm takes a set of p labelled examples. It gets a 
base classifier and assigns weight a to it. The DMarginboost algorithm handles 
noisy data by assigning quadratic loss to the examples who fail to meet the 
target margin. The examples with margin greater than the target margin incur 
no loss. Target margin is updated to minimise the current estimate of the error. 
Derivation of target margin 7 : To find target margin 7 at each iteration 

we minimise the error bound derived in previous section. In other words we 



minimise the quantity 7 ^ — arg min 



CAj 






^ 1 7 



where At = 



and 



and C is the tradeoff parameter between error and 
maximising the margin. Optimal C gives best choice of error/margin tradeoff. 




Ei(7-K/t-i(a:0) + 



CM? 



■E 



yift-i(xi) 



Let B — — and defining St(B) = {i : Byi[ft--i{xi)) < 1} , we wish to minimise 
7 

CA^tf^^+Y,(^-f^yift-ii^i))l^CA^tf^^+ E - Byi{ft-i{xi))f 

i 

Taking derivative of the above equation with respect to target margin B 
2CBAi+ Y. Byi{ft-i{xi))){-yi{ft-i{xi))^Q 

„ _ J2i£St(B) 

CAl + Ei6S4s)(/t-i(^0)^ 

We can calculate the value of 7 from chosen B. For t — 1, B = 2£z£ where c is 
the number of correctly classified examples and p the total number of examples. 
Derivation of a: At each iteration t to calculate the value a of we define 

St(a) — {i : yi{ft-i{xi) + aht{xi))< 7 *}. We now have 

F{a)^ E {'it - yi{ft-i{3:i) + aht{xi)) f 

i£S'^{a) 

Taking derivative with respect to a and setting equal to zero 
dF 

- Y ^ yi{ft-i{^i) + aht{xi)))yiht{xi) ~ 0 

Q E Y yi^t{^i)ht - yiift-ii^ii))) 
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Algorithm DMarginRoost 

Require: 

Training Set: (xi,y\ ),(x 2 ,y 2 ) i^ptUp) where Xi G A and y; G { — 1,+1|. 

A base learner that takes training set and distribution and generates a base classifier. 

.1 

Initialize: Let Di(i) = for i = 1 to p and ai = 1 
P 

Call base learner that generates base classifier h\ and get fi — aihi(x{) 
for t — 2 to T do 

Calculate desired margin jt 

(ft - yift-i{xi 

Call base learner to get base classifier ht 
Calculate at to minimize ^ + aht{xi)))'\. 

Update: ft{xi) — ft-i{xi) + atht{xi) 

end for 

return fx^x) 



Update Distribution: Dt{i) 



Zt is normalisation factor 



In order to solve this we must find the critical values of a for which the set 
S't(a) changes, estimate F for each of these, and then apply the analysis to each 
interval. If the solution lies in its interval, it becomes a candidate solution. Hence 
there are at most 2m candidates from which the optimal must be chosen. 

4 Experiments 

To evaluate the performance of DMarginRoost algorithm, we performed a series 
of experiments on two different domains of data. 

Soft Margin Based Text Booster: The first set of experiments evaluates 

DMarginRoost on a text categorisation task. We used terms (single words or 
phrases - adjacent words) as base learners following [4]. At each iteration T for 
each term, the corresponding weak learner classifies a document as relevant if it 
contains that term. In the next stage, the error of each term is calculated with 
respect to the distribution Df. The error is given by "I'll® 

terms with minimum and maximum error are considered. The negation of the 
term with maximum error is a candidate since we can also use the negation as a 
weak learner by taking a negative coefficient Qj. The selected term is the better 
discriminator of relevance of the two. Finally, all the documents containing the 
selected term are classified relevant and others irrelevant. Initial experiments 
showed no significant difference between words and phrases and only words. 
Therefore experiments have been performed using only words. 

Boosting strategy for Reuters: We evaluated the boosting algorithms on 
the Reuters-21578 Data set compiled by David Lewis. The “ModeApte” split 
which contains 960.2 training and .3299 test documents was used. Pnnctations 
and stop words were removed from the documents. For DMarginRoost the opti- 
mal values of free parameters were set on a validation set. We selected a subset 
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Table 1. FI numberes and Breakeven points for ten largest categories 



Adaboost(Fl) DMarginBoost(Fl) nMarginBoost(B) SVM(l) SVM(2) 



earn 


0.976 


0.977 


0.977 


0.982 


0.98 


acq 


0.906 


0.926 


0.926 


0.926 


0.9.36 


money-: 


fx 0.687 


0.722 


0.725 


0.669 


0.745 


grain 


0.898 


0.904 


0.905 


0.913 


0.946 


crude 


0.880 


0.879 


0.879 


0.86 


0.889 


trade 


0.669 


0.76.3 


0.763 


0.692 


0.759 


interest 


0.570 


0.722 


0.728 


0.698 


0.777 


ship 


0.774 


0.8.38 


0.8.38 


0.820 


0.856 


wheat 


0.878 


0.909 


0.915 


0.831 


0.918 


com 


0.897 


0.908 


0.911 


0.86 


0.903 



of 6723 documents for the training set and 2880 documents for the validation set. 
DMarginboost was run for C — O.f, 1.5, 2.5, 4, 8. Optimal C and corresponding 
T which give minimum error on the validation set were selected. For the chosen 
values of parameters, DMarginRoost was run using the whole training set of 
9603 documents and its performance was evaluated on 3299 test documents. For 
AdaRoost the number of iterations T was selected as explained in [4]. 

Results: For evaluation F measures and Rreakeven point (see [4]) were used. 
Figure 1, Figure 2 and Figure 3 demonstrate how the performance of the solu- 
tion improves with boosting iterations for category ’acq’. Table 1 compares the 
Rreakeven points and FI numbers of DMarginRoost to AdaRoost, SVM(I) and 
SVM(2). Results of SVM(l) and SVM(2) have been taken from [9], [10]. The 
results indicate that in 9 cases the DMarginRoost algorithm gave better results 
than Adaboost,and in 7 cases it is better than SVM (1) and in 1 case it is equal 
to SVM(l) and in 2 cases it is also better than SVM(2). The results show the 
DMarginRoost generally outperforms AdaRoost. 




Fig. 1. FI 



Fig. 2. Precision Fig. 3. Recall 



Boosted Decision Trees: We used C4.5 as base learners for our second set of 
experiments. We selected Ionosphere and Pima-Indiana datasets from the UCI 
repository. Ten random splits of data were used, takiug taking 90% for training 
and 10% for testing. We fixed the value of C — 1 .0 and T — 100. The perfor 
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Table 2. Average Test Set Error for Ionosphere and Pima- Indians 



Examples 


Features AdaBoost DMarginBoost AdaBoost DMarginBoost 


Ionosphere 768 


8 6.9(0) 6.3(0) 


13.4(5) 11.9(5) 


Pima- Indians 351 


34 25.0(0) 24.4(0) 


29.8(5) 28.1(5) 



mance of DMarginRoost in noisy environments, was investigated by introduc- 
ing 5% random label noise (in parentheses). D00M2 on Ionosphere gave errors 
of 9.7% and AdaRoost 10.1% with decision stumps as base learners (See [3]). 
Table 2 shows that the performance of DMarginRoost (even with C — 1) over 
Adaboost is better than the performance of DOOM2 over AdaRoost in 3 cases. 

5 Conclusion 

The paper has developed a novel boosting algorithm. The algorithm optimises 
the size of the target margin by minimising the error bound. We presented ex- 
periments in which the algorithm was compared with Adaboost on a set of cate- 
gories from the Renters-21578 data set and data sets form the UCI. The results 
were very encouraging showing that the new algorithm generally outperforms 
AdaRoost both in noisy and nonnoisy environments. 
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Abstract. It is generally convinced that pre-processing for data mining 
is needed to exclude irrelevant and meaningless aspects of data before 
applying data inining algorithms. From this viewpoint, we have already 
proposed a notion of Information Theoretical Abutraction, and imple- 
mented a system ITA. Given a relational database and a family of possi- 
ble abstractions for its attribute values, called an abstraction hierarchy, 
ITA selects the best abstraction among the possible ones so that class 
distributions needed to perform our ela.ssirication task are preserved as 
possibly as wo can. According to our previous experiment, just one appli- 
cation of abstraction for the whole database has shown its effectiveness 
in reducing the size of detected rules, without making the classification 
error worse. However, as C4.5 performs serial attribute-selection repeat- 
edly, ITA does not generally guarantee the preservingness of class distri- 
butions, given a sequence of attribute-selections. For this reason, in this 
paper, we propose a new version of ITA, called iterative ITA, so that it 
tries to keep the class dislribiilions in each attribute schiction step ns 
possibly as wo can. 



1 Introduction 

Many studies on data inining have concentrated on developing methods for ex- 
tracting useful knowledge from very large databases effectively. However rules 
detected by those methods include even meaningless rules as well as meaningful 
ones. Thus, pre-processings are also important to exclude irrelevant aspects of 
data. There exist some teclnii(|ues commonly used in the pre-processing [1,3, 
11]. For instance, feature selection methods focus on a particular subset of at- 
tributes relevant to the aim of the mining task. Furthermore, generalization of 
databases is also a powerful tedmi(|ue not only for preventing the inining task 
limn extracting meaningless rules but also making the detected rules more under- 
standable. The attribute-oriented induction used in DBMincr[5] and a method 
in INLEN[12] for learning rules with structured attributes are typical instance 
of the generalization method. 

We consider in this paper that such a generalization method is an instance 
of abstraction strategy [G] shown in Figure 1.1. In general, if it is difficult to find 
out a solution 5 from a problem P at concrete level, we transform P into an ab- 
stract problem ahs(P). Furthcnnorc we find out an abstract solution abs{S) from 

K.S. Leung, L.-W. Chan, and H. Meng (Eds.): IDEAL 2000, LNCS 1983, pp. 60-70, 2000. 

© Springer-Verlag Berlin Heidelberg 2000 
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Fig. 1.1. An abstraction strategy in our method 



ahs{P) and transform abs{S) into S. DBMiner and INLEN are also considered 
to be based on this abstraction strategy. 

Wc arc particularly interested in the problem of detennining an appropriate 
al)straction among |) 0 ssiblc ones. This is because the abstraction strategy is actu- 
ally d(!pendent on the choice of abstractions. In I, lie case of an attribute-oriented 
induction algoril.lim in DDMincr, given a concept hierarchy with a single inheri- 
tance structure, it can decide an adecpiate abstract level based on a tliresliold of 
the number of attribute values. However, for tlie liierarchy provides only a single 
inheritance, the abstraction may lose some important aspect of data not cov- 
ered by the liierarchy. The generalization process in INLEN uses anchor nodes, 
defined as concepts based on typicality notion, to select the best abstraction. 
However it is generally hard for user to define such an anchor node before the 
mining task. Conseiiuently, INLEN also may miss the important aspect of data 
needed to perform our mining task. 

In the previous work[7], we have already proposed a notion of Inforination 
Theoretical Abstraction (ITA) to overcome these problems particularly on classi- 
fication tasks. ITA is an extension of the attribute-oriented induction. Compared 
with the attribute-oriented induction, ITA can automatically select the best ab- 
straction by minimizing tlie loss of information, where the information needed 
for our classification is the class distribution. Furthermore assuming that tlie 
anchor nodes used in INLEN correspond to our appropriate abstractions, ITA 
can also automatically decide the anchor nodes. 

ITA and iterative ITA, introduced in this paper, generalize databases Re- 
cording to the abstraction strategy in Figure 1.1 in the following sense. Firstly, 
tlie original database D is generalized to an abstract database D' by replac- 
ing attributes values in D with corresponding abstract values dolerniinoti by an 
a|)i)ropriate abstraction abs. Secondly, it detects a compact decision tree VT', 
called an abstract decision tree, from D'. Finally, we interpret VT' and traiisfonn 
VT' into VT. 

In principle, given target classes, ITA automatically selects an appropriate 
abstraction in abstraction hierarcliies with multiple inheritances and generalizes 
the database based on the selected abstraction. The abstraction is said to be 
appropriate if for a given target attribute, class distributions in the original 
databa.se arc preserved in the rcsullaiil of generalization as possibly as we can. 
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replacing attribute values with abstract 
concepts based on an appropriate abstraction. 



An abstract decision tree 
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selecting an appropriate abstraction in each attribute selection step. 
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Fig. 1.2. All oiroct, of Uin abstraittioii for tlio docision troo 



because such classification systems as C4.5 construct the decision tree based on 
the class distribution. If the attribute values share the same or a similar class 
distribution, they can be considered not to have significant differences about the 
class. So they can be abstracted to a single value at abstract level. On the other 
liniid, wlieii tlie values liavc distinguisliablc class distributions, tlic difference will 
be significant to perform classification in terms of attribute values. Hence, the 
difference sliould not be disregarded in tlie abstraction. 

We liave shown that the same measure, an information gain ratio, as used 
in C4.5 can be adopted to measure the difference of class distributions!?]. It has 
already empirically shown that tlie classification error of the abstract decision 
tree T>T' is almost the same as the decision tree T>T directly computed by C4.5 
from tlie original database D. Nevertheless, the size oiVT' is drastically reduced, 
compared witli T>T' 

Thus, just once application of abstraction for wliole database has been exper- 
imentally shown its effectiveness in reducing the size of detected rules, witliout 
making the classification accuracy so worse. However, as C4.5 performs serial 
attribute selections repeatedly, ITA does not generally guarantee the iircserv- 
iiigiiess of class distributions in each selection step. Hence, when we require that 
the elassifieation aceiiiney of T>T' iiiii.st he aliiiosl, eipial to VT, we can not al- 
low the classification accuracy to go down even slightly. For this reason, in this 
paper, we propose a new version of ITA, called iterative ITA so that it tries to 
keep the class distribution in each attribute selection step as possibly as we can. 

Figure 1.2 illustrates a generalization process in the iterative ITA. Iterative 
ITA selects an appropriate abstraction in eacli attribute selection step and con- 
structs a compact decision tree, called an abstract decision tree. That is, we 
Iiropose to perform our generalization process in each attribute-selection stop in 
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C4.5, where an attribute based on which a decision tree is expanded. Each node 
Ni in a tree has the corresponding sub-database DjVi of the original D, where 
the sub-database is set of instances obtained by selecting tuples. For such a sub- 
database C4.5 selects an another attribute Ai to furthermore expand the 
tree. We try to find an appropriate abstraction ip for that Ai so that the target 
class distribution given A{ values can he preserved even after generalizing the Ai 
values V to more abstract value (p{v). The generalized database is also denoted 
I’y = <P{DNi)- 

At the abstract level, is divided according to the abstract Ai values 
<p{v), as similarly to the concrete level division by the concrete Ai values v. It 
should be noted that the number of new branches of Ni is less than (or equal to) 
one obtained by dividing Hiv, , because the attribute values in are merged 
into abstract ones in In addition, since the best abstraction is chosen in 
each step, the abstract level class distribution is the closest one to the concrete 
level distribution. In other words, the regression of precision is minimized in each 
step of attribute selection among the possible abstractions. As the result, the 
lu'ccision of detected rules will become much closer to one of C4.5, while keeping 
the same property of reducing the size of detected rules as to non-iterative ITA. 

This paper is organized as follows. In section 2, we present a brief summary 
of ITA. Section 3 describes the iterative ITA and its principle. In section 4, we 
evaluate iterative ITA with some experiments on census database in US Census 
Bureau. Section 5 concludes this paper. 

2 Preliminary 

Our system ITA[7] was developed based on the idea that the behavior of the 
information gain ratio used in C4.5[1G] applies to the generalization process in 
the attribute-oriented induction[5]. 



2.1 Information gain ratio 

Let a data set S' be a set of instances of a relational schema i?(Ai, ..., A„), where 
Ak is an attribute. Furthermore we assume that user specifies an assignment C 
of a class information to each tuple in S. We regard it as a random variable with 
the probability Pr{C = Cj) = freq[Ci,S)/\S\ , where |S| is the cardinality of 5, 
and fre(ji{cj,S) denotes the number of all tuples in S whose class is Cj. Then the 
entropy of the class distribution {Pi(C = ci), ..., Pr{C' = c,,)) over S is given by 

H{C) = - S Pr(C = a) log 2 Pr(C = ct). 

*=1 

Now, given an attribute value aj ot A — {oi, we obtain a posterior 

class distribution (Pr((7 = Ci|A = Cj), ..., Pr((7 = c„|A = aj)) that has the 

n 

corresponding entropy JI(CjA = aj) — ~ E Pr(C = Ci\A = aj)log2Pr(C' = 

i=l 

Ci\A = Oj). The expectation of these entropies of posterior class distributions 

i 

is called a conditional entropy H{C\A) = S Pr(A = aj)H{C\A = aj). The 
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subtraction of H[C\A) from H{C) gives an information gain, that is also called 
a mutual information I{C\A). 

gain{A, S) = H{C) - H{C\A) = I{C\ A). ( 2.1) 

To normalize the information gain, the entropy of an attribute .4 = {ai,...,a/}, 
called a split information, is used. 

split In fo{A,S) = - E Pr(.4 = a^) logj Pr(yl = aj) = H{A). ( 2.2) 

1=1 

Finally, the information gain is divided by the split information. The normalized 
information gain is called an information gain ratio 

gainjratio[A,S) = gain{A, S) / split Jnfo{A, S) = I(C\A)/H[A). ( 2.3) 

2.2 The basic concept of ITA 

ITA adopts an information gain ratio to select an appropriate abstraction and 
controls generalization based on the selected abstraction. In principle, given 
target classes, a grouping of tuples in a relational database, an abstraction 
preserving the class distribution of the attribute values is preferred and se- 
lected as an appropriate one. If some attribute values oi, . . . , Um of an attribute 
A = {ai, ...,af}(m < £) share an almost same or similar posterior class distri- 
bution (Pr(C = Ci\A = Oj), ..., Pr((7 = Cn\A = aj)), an “abstract class distribu- 
tion”, defined as 

(Pr(C = ci|A e {ui, ...,am)), ...,Pr(C7 = c„\A e {ai, 

= (,r- lA,. Pr((7 = ci|A = aj), ..., EJl.Xj Fv{C = c„|A = a,)) 

also shows an almost same or similar class distribution, where Xj = Pr(A = 
aj)/E-l-y Pr(A = Oj). Thus, an abstraction identifying these oj, ..., a„, preserves 
the necessary information about classes, so they can be abstracted to a single 
abstract value. In other words, we consider only an abstraction that preserves 
class distribution as possibly as we can. 

ITA uses the information gain ratio as one of the measure to define a sim- 
ilarity between class distributions. The point is the change of the information 
gain ratio when the attribute values ui, . . . of A are abstracted to a single 
value. An abstraction is considered as a mapping / : A -4 /(A), where /(A) is 
an absti'act concept at abstract level. 

In general, according to the basic theorem of the entropy and the data- 
processing theorem described in the literature[2], the following inequalities hold 
between H(C\A) and H{C\f{A)): 

H(C\A) < H{C\f(A)), I{C-, A) > I{C-, /(A)) 

Hence the information gain decreases after the abstraction of attribute values. 
The difference e(f) = H{C\f{A)) — H(C\A) can show the similarity between 
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the distributions. Tliat is, an abstraction / meets e(/) ~ 0 (i.e. /(C;A) ~ 
I{C\ f{A))) is a preferable mapping. 

Purthcrmorc the split information H{f{A)) is used to choose one abstraction 
from two or more preferable abstractions, it can merge most attribute values into 
the corresponding abstract conceitt. Fur examjtlo, we assume two abstractions Ji 
and /2 meet I{C]A) ~ J(C';/i(yl)) = I{C\f2{A)). If A can replace more num- 
ber of attribute values with the abstract concept as compared with A, the split 
information between /i and A holds H{fi(A)) < H(A(j4)), and then the infor- 
mation gain ratio holds I{C', fi{A))/H{fi{A)) > I{C\ f2{A)) j H{j2{A)) ■ Thus 
ITA can select the abstraction A- Prom these observations, our generalization 
method using the information gain ratio can select an abstraction that preserves 
the class distribution and replaces more attribute values with the abstract con- 
cept. 



The generalization process in ITA are summarized as the following algo- 
rithm. In the algorithm, the term “change ratio” moans the ratio of the informa- 
tioji gain ratio after applying goneralizatioji to one before applying generaliza- 
tion. An abstraction in abstraction hierarchies is (lofined ns a possible grouping 
( («i - • • • - «ni } . • ■ • 1 l«i ‘ - • • • - “n‘„ } 1 of attribute values , .... uj,, , ... , , . . . , 

a™ I in a relational database. An abstraction for attribute values means to re- 

'*'m 

place attribute values in the grouping with the corresponding abstract concept 
, . . . , hm } . 



Algorithm 2.1 (information theoretical abstraction) 
Input : (1) a relational database, 

(2) target classes (i.e. target attribute values), 

(3) abstraction hierarchies and 

(4) a threshold value of the change ratio. 
Output : a generalized database. 



1. By some relational operations (e.g. projection and selection), extract a data 
set that are relevant to the target classes. 

2. Select an attribute from the database before applying generalization and 
compute the information gain ratio for the attribute. 

3. Compute the information gain ratio for each abstraction in the hierarchies 
for the attribute and select an abstraction with the maximum inforinatioii 
gain ratio. 

4. Compute the change ratio. If it is above the tlireshold value, substitute 
abstract values in the abstraction for the attribute values in the database. 

5. Merge overlapping tuples into one, count the number of merged tuples, and 
then the special attribute vote is added to each tuple in order to record how 
many tuples in the original database are merged into one abstract tuple as 
the result. 

G. Repeat above four steps for all attributes. 
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3 Iterative ITA 

Since ITA performs just one application of abstraction for wliole database, ITA 
docs not generally guarantee the prcscrvingncss of class distribution in each at- 
tribute selection step in C4.5. That is, in case of requesting that the classification 
accuracy of the decision tree detected from the generalized database is almost 
equal to the original one, ITA can not completely respond to this request for the 
classification accuracy. 

For this reason, we propose iterative ITA in order to keep the class distribu- 
tion in each attribute selection step as possibly as we can. The iterative ITA in 
the form of algorithm is shown as follows. 

Algorithm 3.1 (Iterative ITA) 

Input : (1) a relational database D, 

(2) target classes Ck (1 < ^ < ?i)i 

(3) abstraction hierarchies AHs and 

(4) a threshold value of the change ratio T. 

Output : an abstract decision tree W ■ 

(The following l,rn,n,p are constants and x,y,z are any numbers.) 

1. Compute the information gain ratios for all attributes in D and select an 
attribute A^ according to the maximum information gain ratio. 

2. Compute the information gain ratios for possible abstractions /> (1 < * < /) 
in AHe for Ax and select an abstraction fy if its information gain ratio is 
the highest and its change ratio is higher than T. 

3. If fy is selected at step 2 : 

(i) Select all tuples that contain attribute values correspond to an abstract 
concept a,j = fy{Ax) (1 < i < m). 

(ii) Divide D into sub-databases Dj according to a,j . 

Otherwise : 

(i) Select all tuples that contain an attribute value vj (1 < j < p). 

(ii) Divide D into sub-databases Dj according to Vj . 

4. Compute a classification error rate ER for D and classification error rates 
ERj for Dj. 

If any error rate in ERj is larger than ER : 

(i) Create child nodes correspond to Dj that branch from a current node 
corresponds to D. 

(ii) For each Dj, regard Dj as D and |)crform step 1-4 for new D repeatedly. 
Otlicrwise ; 

Assign a class label to a current node corresponds to D . 

Algorithm 3.1 has the following features. 

- The process of constructing the decision tree in C4.5 divides the current node 
corresponds to D according to all attribute values {uj , . . . , 0 ),^ , . . . , uj" , . . . , 
} in Ax ■ The number of the branches in the original decision tree is ni + 
\-7im- On the other hand, iterative ITA divides the current node according 
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to all abstract concepts in the grouping {{uj , . . . , },• • ■ i }} = 

{«i , . . . , a,n) = fy{A^,) selected in each attribute selection step. The number 
of the branches in the abstract decision tree is m. An inequality ni + • ■ • + 
Um > rn holds between Tii + • • ■ +rim and m. Therefore iterative ITA reduces 
I, he branches in the original decision tree constructed by C4.5. 

- Roughly speaking, we can say that a condition of terminating the growth of 
the decision tree in step 4 is concerned with a question of whether the pos- 
terior class distributions in the child nodes become to show distinguishable 
higher probabilities for particular classes or not. Suppose that such a devi- 
ation of the class distribution in some child node is larger than the current 
node. This means that the classification accuracy in Dj is improved. In such 
a case, our iterative ITA continues a process of constructing the decision tree. 
Otlicrwise it terminates the expansion process. Since the class distribution 
at abstract level is the average of those at concrete level, the deviation at 
abstract level turns out to be smaller tliaii the deviation at concrete level. In 
other words, a chance of improving the classification accuracy may be lost 
by applying abstraction. 

Furthermore, suppose that every posterior class distributions of child nodes 
at concrete level do not show distinguished deviations. This means that the 
condition for terminating the expansion process holds. Then, for any abstrac- 
tion, any posterior class distribution at abstract level does not also show the 
distinguished deviations. This is again because the abstract distribution is 
defined as the average of concrete level distributions. As a result, we can say 
that stopping condition for expansion at abstract level invoke whenever it 
docs at concrete level. 

As the result, the precision of the abstract decision tree will be come much 
closer to one of C4.5, while keeping tlie same property of reducing the size of 
the decision tree as to non-iterative ITA. 



4 Experiment on Census Database 

We have made some experiments using our iterative ITA system that has been 
improved based on the proposed method and implemented in Visual C+-1- on 
PC/AT. 

In our experimentation, we try to detect decision trees from Census Database 
in US Census Bureau found in UCI repository[I5]. The census database used as 
training data in our experiment consists of 301C2 tuples each of which has values 
for 11 attributes including age, workclass, education, occupation, relationship, 
race, sex, hours-per-week, native-country and salary. Apart from this training 
data, we prepare a small database (called test data) consisting of 15060 tuples 
in order to check a classification accuracy of a detected decision tree. The ab- 
straction hierarchies for attribute values in the census database are constructed 
based on a machine readable dictionary WordNet[13, 14] and are given to our 
system. 
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Fig. 4.2. Error rate of decision trees 



Let us assume that target classes are “< $50ff” (NOT more than $50000) and 
“> SOOff” (more than $50000) in an attribute salary. Iterative ITA generalizes 
the census database based on the information gain ratio. We adjust a threshold 
of a change ratio to construct various abstract decision trees using iterative ITA 
and compare our abstract decision trees constructed by iterative ITA with the 
original decision tree constructed by C4.5 

The sizes and the error rates for tlieir decision trees are shown in Figure 4.1 
and Figure 4.2 respectively. The size of each abstract decision tree constructed 
l)y iterative ITA is smaller than the original one, because iterative ITA reduces 
the branches in the original decision tree and its depth, as mentioned in Sectiond. 
The form of the graph in Figure 4.1 comes to be nearly flat when the threshold 
of the change ratio is more than 2.0, because ITA could not select an apirropriate 
abstraction. 

Figure 4.2(a) shows that the diflerence between the error rale of abstract 
decision tree and the original one is about 0.03 when the change ratio is more 
tlian 1.0. The classification ability of C4.5 for the training data has better than 
iterative ITA. This can be understood in the following manner. Each node in 
the abstract decision tree keeps the class distribution before generalization as 
much as possible. However the preservingness of the class distribution in each 
attribute selection step causes a little error. On the other hand, the error rate of 
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the abstract decision tree for the test data, shown in Figure 4.2(b), is nearly equal 
to one for the training data. Therefore the original decision tree is too-specific 
for the training data. 

Furthcnnorc when the threshold of the change ratio is about 1.0, the result 
of constructing the decision tree is sliown in Table 4.1. The abstract decision 
tree constructed by Iterative ITA is the most compact and it« error rate for the 
test data is the best in the three. 



Table 4.1. Decision trees 







Error rate 




Test data 


C4.5 


6048 


0.131 


0.189 


ITA 


2562 


0.140 


0.180 


Iterative ITA 


1533 


0.161 


0.177 



From these observations, we consider that iterative ITA is useful for con- 
structing a compact decision tree whose error rate is approximately equal to one 
before generalization, because the size has drastically decreased at the sacrifice 
of slightly increasing the error rate for the training data. 



5 Conclusions and Future Work 

In our previous work[7], we have used ITA as the pre-processing for C4.5 which 
constructs the decision tree. However, assuming that user requests that the clas- 
sification accuracy of the abstract decision tree is almost equal to the original 
one, just one application of abstraction for whole database can not meet that 
request. For this reason, we have proposed iterative ITA which performs our 
generalization process in each attribute-selection step in C4.5. We can consider 
that iterative ITA is useful for constructing a compact abstract decision tree, 
that is more understandable for almost users, whose regression of error rate 
is minimized among given classes of abstractions. That is, it is important for 
making the interpretation of the resulting tree easy to apply abstraction to the 
l)iocess of constructing the decision tree. Furthermore if we apply any pruning 
technique to the abstract decision tree, the tree will be still more compact, and 
the undcrslandabilty of the tree will be increasingly improving. 

Normal ITA generalizes the database in the pre-processing for constructing 
the decision tree. On the other hand, iterative ITA generalizes a sub-database 
selected by each node in the decision tree while C4.5 is constructing the decision 
tree. Ideally, since each node in the abstract decision tree constructed by iter- 
ative ITA keeps the class distribution before generalization, we can extract the 
structure of the original decision tree from the structure of the abstract decision 
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tree. Therefore we consider that iterative ITA can be used as the post-processing 
for the decision tree (e.g. visualization). 

Abstraction hierarcliies used in the iterative ITA are manually constructed 
acctordiiig to WordNet. It is a hard task for user and system administrators to 
construct abstraction hierarcliies for very large database. Therefore, in future 
work, we have to develop a method that automatically constructs abstraction 
hierarchies using a machine readable dictionary, e.g. WordNet. At the moment, 
we have already developed a first version of the method. 
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Abstract. In almost every area of human activity, the formation of huge databases 
has created a massive request for new tools to transform data into task oriented 
knowledge. Our work concentrates on real-world problems, where the learner has o 
handle problems dealing with data sets containing large amounts of irrelevant 
information. Our objective is to improve the way large data sets are processed. In 
fact, irrelevant information perturb the knowledge data discovery process. That is 
why we look for efficient methods to automatically analyze huge data sets and extract 
relevant features and examples. This paper presents an heuristic algorithm dedicated 
to example selection. In order to illustrate our algorithm capabilities, we present 
results of its application to an artificial data set, and the way it has been used to 
determine the best human resource allocation in a factory scheduling problem. Our 
experiments have indicated many advantages of the proposed methodology. 



1 Introduction 

The objective of our research is to allow a better use of large amounts of data, so we have 
focused our interest in the Knowledge Data Discovery (KDD) process. More precisely, our work 
attend to the problem of concept learning from examples. In our study, we concentrate on data 
filtering that is an essential step for the Data Mining phase [1]. Our goal is to draw the underlying 
structures of the data to be processed trough the KDD successive steps. From input data, coded as 
attribute- value data, we obtain a set of examples (step O). During the next step, considering our 
population, relevant data have to be selected in order to avoid a wrong model representation of the 
concept to be learned(step ©). These information are then structured (step ©) and knowledge is 
generated in the form of rules, decision trees or decision structures [2], In our experiments, we 
generate decision trees using the well known J.R. Quinlan's C4.5 algorithm [3]. 

Data ^ O Extraction ^ Examples ^ © Filtering ^ © Structuration ^ Knowledge 
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Data Filtering 

An excess of information can hide useful information. That is why we try to filter noisy data by 
selecting only data that explain the goal variable (class). The data selection consists in the choice of 
a data set (variables and examples) as small as possible but necessary and sufficient to describe the 
target concept. The data dimension can be reduced by selecting reduction variables or selecting 
relevant examples. Most works emphasize the first approach as in [4], [5], [6], [7], [8], [9], [10], 
[11], [12], [13]. Still, we decided to stress the example selection problem, just as some attributes 
are useful than others, so may some examples better aid the learning process than others [14]. 
Researchers [4], [14], [15] have pointed out three main reasons for selecting relevant examples 
during the induction process. If we have lots of unlabeled examples, or if they are easy to generate, 
the cost of labeling is usually high (labels must be obtained from experts). Moreover the learning 
algorithm can be computationally intensive, in this case for purposes of computational efficiency 
it's worth learning only from some examples if sufficient training data is available. A third reason 
for example selection is to increase the rate of learning by focusing attention on informative 
examples. Liu and Motoda [16] compare different works dealing with data selection and point out 
four major points, the type of research, the organization of research, the selection criterion and the 
stopping criterion. Michaut [17] gives a recapitulative table of the filtering algorithms sorting them 
according these four points and the induction methods used after the selection algorithms. In this 
paper we propose an heuristic algorithm to reduce the data dimension using a sequential forward 
generation research strategy. In other words, we create a kernel of examples N starting with an 
empty set. The process is stopped when the obtained kernel N is equivalent to the starting example 
set Cl. In Section 2 our problem is formalized using Diday's symbolism [18], [19], then Section 3 
expose the criterion used in our algorithm that is to say the degree of generality concept, the 
discriminant power between two objects and the discriminant power between an object and a set of 
objects. From these definitions we propose Section 4 a new algorithm to select relevant objects. 
Then, Section 5 describes two examples illustrating the advantages of the proposed methodology. 



2 Knowledge Representation 

"From the elementary objects characterized by the values taken by several variables, several 
symbolic objects can be defined" Kodratof, Diday [20]. 

The Symbolic Objects 

A symbolic object is defined as a conjunction of properties on the values taken by the variables, 
n is the studied population. Cl the sample of n observed examples (instances) O=[C 0 |, C 02 ,...,C 0 „| 
and Y is the set of r qualitative variables Y =| yi,. . .,yi(,. . .,yr } defined over fl 

Let yi( be defined as y^ : ICl^Ot 

[ffl->yj(®)=TO- 

where Ot is the space of modalities containing a finite set of yt observed values. 

Oi={mf,...,mf,...,m‘,}, where mf is the modality v of the variable 

with 0=OiX O 2 X. . .X OfcX. . .X O, the set of all possible elementary objects, 

1ft)— >F(ft))=(Yi(ft)),...,T/-(ft))) where Y((o) is the description of the object (O. 



and 



(2) 
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Elementary Events : An elementary event is denoted e=[y=VJ where 1/ cO^ is a predicate 
expressing that "variable takes its values y/ co) in V". 

Each cOi is characterized by a set of r variables (attributes) 

[ Q^Ot 

yi,..., yt,..., yr < (3) 

[o)i—> yt{(Oi)=mtt 

For an elementary object co,, e^is restricted to a single value mi / Vt= mi Vk=l,...,r 
Ok is a set of mt modalities (values) of the variable y^ . 

where mi is the modality v of the variable y^ for the elementary 

object COi . i=l,-.,n. 

Let 0=0iX02X...x0kX..x0i the workspace and Y an application : y . 

■[co^F(®)=(yi(co),...,yr(®)) 

Assertion Objects : We define an assertion object like a logical conjunction of elementary objets. : 
a = [y,= V,]A ... A[yk=Vk]A ... A[y,j=VJ with q<r. (4) 

a is a conjunction of elementary events simultaneously true for a given elementary object co. 



3 Functional Modelisation 



The Notion of Relevance 

To each variable y^ we associate a function Cpi relative to each elementary object e. 

cnk. [fi^{0,l} 

■ <C0-^i{C0ilni)=lCr^yk{C0i)=nic0-^i{C0ilmi)=\<r^yt(C0i)=mi (5) 

\elsecpf(cOil 



Let Kbe an application, we associate the function O; relative to each assertion object a. 



We have Y: 



f l} 

<®^<I>i(®/mJ...mJ...nif)=lC=^yi:(ffl)={yt(®)=mjA...Ayi(mi)=mf A...Ayt(c»)=m,c} 
[ else^i{COi/ml...ml...ml)=0 



(6) 



An assertion object describes a class and we look for the assertion objects corresponding to each 
class. However, we should give a definition of relevance, Blum and Langley [4], using our own 
formalism. 



Strong relevance : A variable y^ is relevant regarding a goal variable yg„ai and a population f2, if 
two objects belonging to the space having two different values of ygoai only differ when 
considering the value taken by the attribute yc . 



For cOi.and.C0iB.€b<£l..yjyi,l=\..r^kl .yk{cop=yk(cOi).andysoai(coi)i=ygoai(cOi) (7) 

We should stress the fact that this constraint is very strong. Consequently, we give an another 
definition to make a distinction between a non-pertinent variable and a pertinent but redundant 
variable. For example, if two variables are redundant, they will not be considerate as relevant, even 
if they are found pertinent considering them separately. Thus we define a weak relevance. 

Weak relevance : A variable y^ is weakly relevant regarding a goal variable yg„ai and a population 
Q, if it is possible to remove a subset of variables so that yi^ become strongly relevant. A 
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probabilistic definition also exists but as we do not use it, it is not presented in this paper. We 
introduce now a Boolean function to measure relevance. Consider the following Boolean functions 



{(Oi,(Oj)\-^(ptj=(pk(o)i,(Oj}=0<^yk(o>)Ttyk((Oj)'^(ptj=l 



( 8 ) 



((Oi,(Oj)\->f!^J‘‘'=(pgoi,ii(a,(Oj)=0‘^ygoaii(Oi)Ttygoai((Oj)<^(pfj‘‘'=l 

We can now define a function relative to one variable (attribute) for a pair of objects. When a 
pair of objects is relevant considering the goal variable yg„ai, then this function takes the value 1. 

{(a,co j =1 



We define the first variable aggregation measure as the following sura 5= 






TCOiCOJ 



(9) 



If S=1 then we only have one relevant attribute. Consequently this measure gives us the strong 
relevance when it takes the value 1 . 

The following example (Table 1) illustrates the notion of strong relevance : 



Table 1. Three objects {Wi, 002,(03! withy2 strongly relevant 





yi 


y2 


y 3 


y 4 


ygoai 


Ml 


2 


1 


1 


3 


1 


(o 2 


2 


2 


1 


3 


2 


0)3 


2 


4 


1 


3 


2 



The variable y2 is strongly relevant regarding the variable because the objects considered only 
differ when considering the value of attribute V2. 



The degree of generality 

The degree of generality g is defined as follows : r dV k=l,...,r 

Assertion Objects and Degree of Generality 

Let a be an assertion object : a = [yi=VijA ... A[yg=Vt]A ... A[y^=Vr] 
a is composed of a set of elementary objects, like for example {cOi CO7 cojlwith : 



( 10 ) 



We have : 



gt(®)= 



cardfyi) 

card(Ot) 



(0] ^G(a) = 



j^ card(V|,) 
k = l Ol; 



gi(a)x....xgi,(a)x...xg,(a) 



( 11 ) 



gt(.si)= 



card{Vi ) 
card(Ok) 



K=jium and , card{V2) _card{Viy(p!„ 
* ^ card{02) cardiO^) 



The evaluation of (p^.^ is trivial. If (p ^,^=0 then cOi is not useful to the intension (set of objects) of 
the elementary event associated with y^. In order to take into account all variables we perform an 
aggregation of the different functions. 

We obtain : $(ft)i,y)= 7 (12) 

*=i 



If <I>(ftlr, 5)=0 ft), will not be part of the kernel representing a given class. For each modality of 
the goal variable we want to define the assertion object. Let Ygoai be the goal variable : 
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YGoai=[Ci,...,CrGoai]- Cl is defined by the assertion object ai which intension will be estimated by 
the aggregation of all elements belonging to Q. We generalize the function defined for one class to 
the whole population : 

^,„al(0).,S)=^Z-'EK^*(PZ‘ ( 13 ) 

k=l 

if cOi has an intension already described by s so it will not be selected 

if =1 cOj and coi are neighbors and only differ for a single element. 

Let p be the corresponding index with 

We can aggregate co, and ( 0 ^ by creating an assertion object a„ such as : 

am = yi(0')i )A ... A[yt(<Ai)]A ... A[yp((0i )v yp (0)i )] (14) 

In other words , we have to select the indispensable elementary objects and use them to build the 
kernel of objects we look for. 



4 Example Selection Algorithm 



The algorithm we present is similar to the simple selection greedy algorithms as those proposed 
by Almuallin and Dietterich [22] or Vignes [23]. It's discribed by the following criteria : 

Search Strategy : heuristic , variables are selected in accordance with a criterion which is 
maximized or minimized 

Generation Scheme : sequential forward , the algorithm starts with an empty set 

Evaluation Measure : relevance. : discriminant power for the variables and generality degree for 

the objects 

Stopping Criterion : ODPsoal(S)= ODPsoaUY) for the variables and 0(N)=0(i2) for the objects 
Total discriminant power Total degree of generality 



Example Selection Algorithm 
O Initialization N = 0 
© Extra-class work 

- Look for the most relevant objects 

- Selection of the strongly relevant pairs 



( ( 0 , 0 ) J such as 



JK,. 



then 

Yoxoj ^ 



N=N+{(o,(oJ 



® Intra-class work 

Stopping Criterion For each class we evaluate the degree of generality of class 4>(ili) 
Repeat Choice of the relevant example 

Choice of ) such as 

Until <I>(M)=<D(ili) 



(Om,S 



lfc=l 

In the next section we present the experimental results our filtering algorithm. 
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5 Experimental Results 

Artificial Data Set 

We have used C++ language to implement our algorithm and the pre-processing tasks which 
prepare the data set to be processed with our method. These pre-processing tasks only perform a 
sorting by class and write several output data information files to help our algorithm to work faster. 
We should notice that, this does not interfere in any way with the initial data set content. To 
facilitate the understanding of our algorithm we consider an artificial data set containing 20 objects 
(examples) and 4 variables (attributes). The set of objects (see Table 2) does not contain redundant 
objects because their elimination is trivial. We use J.R. Quinlan's C4.5 algorithm [12], [3], to 
generate the decision trees. We will now apply our algorithm on the full data set and then generate 
a new decision tree with the reduced set and compare it with the first one. 



Table 2. Data set used to evaluate the object selection algorithm 





yi 




Y3 


Y4 


Yaoul 




Yi 


Y2 


Y3 


Y4 


Yareil 


ml 










1 


0)1 1 
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3 
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4 


co2 




2 
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1 


0)12 
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4 
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0)3 






2 
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0)13 
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4 
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4 


0)4 
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2 


2 


2 


0)14 


3 


4 




2 


4 


0)5 


2 


2 




2 


2 


0)15 
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3 
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0)6 
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2 


2 
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0)16 






2 
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0)7 
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0)17 


3 
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0)8 
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0)18 
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0)9 


3 


2 


3 




3 


0)19 




4 


2 


3 


5 


0)10 


2 


3 




3 


4 


0)20 




4 


3 


3 


5 



During step O, the algorithm evaluate the strong relevance pairs, and then constructs the kernel of 
objects to take into consideration. After step © the algorithm gives us the following results : 

Strong relevance pairs : C = (cth 004) f CO2 cosj j (O3 coi^l f CO3 cOg} f (O3 (Oi2j f OO; (Oigj fcon cOisJ 

Kernel = (0)20)30)40)5 0)g (0,2 0,3 m,5 (0,6 CO, 3(020} 

Step ® completes the example selection : 

Objects to be added to the kernel : 

None for class 1 ((02(03} None for class 4 ( (0,2 (0,3 (0,5} 

None for class 2 (0)4(05} {0)20} for class 5 ( ( 0,6 (0,8(020} {CO7} for class 3 ((07(0s} 

At the end of the algorithm we obtain the final set of objects (Table 3). We should stress the fact 
that the obtained set is 40% smaller. 

Table 3. Reduced set of objects obtained after the algorithm selection : 40% smaller 





Yi 


Y2 


Y3 


Y4 
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0)20 
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5 



In order to evaluate our selection of relevant objects we have generated two trees with C4.5. The 
first decision tree has been generated using the original data set (Table 2) and the second decision 



A New Algorithm to Select Learning Examples from Learning Data 77 



tree using the reduced set of objects (Table 3). We observe that the decision trees obtained are 
exactly the same, which means that there is no information loss. As a consequence, if we try to 
classify the objects eliminated during the filtering phase performed by our algorithm using the 
decision tree, we obtain a good classification rate of 100%. Our algorithm has substantially 
reduced the number of objects necessary to construct a classifier (set 40% smaller) while 
maintaining the classification rate unchanged. 

A Human Resource Allocation Problem 

We have used our algorithm to solve a real-world human resource allocation problem. The case 
studied is a factory that produces gold chains and chain bracelets. The aim is to obtain a scheduling 
system capable to learn the human resource allocation and queue heuristics considering the 
manufacturing orders for a given period. We evaluate the scheduling using the performance metrics 
given by the scheduling software Preactor : cost, due dates, priority, idle time etc. By repeating the 
process several times, we obtain a learning base of the best workshop configurations considering 
the whole set of possible cases. Then, we eliminate all non-pertinent examples in order to construct 
a learning system able to set up automatically the software parameters when a new set of 
manufacturing orders arrives. We have implemented our method using Visual C-i-t and Preactor 
and the results obtained show a better and faster response of the considered manufacturing system. 



Conclusion 

This paper described a filtering algorithm to select the significant examples of a learning set. 
Our example selection algorithm has shown its efficiency to determine a subset of relevant objects 
from a learning set. As a mater of fact, the classifier obtained using the reduced learning data set 
offers the same precision that the one induced from the complete learning data set. We also used 
our algorithm on huge data sets and the results obtained are far more better than those obtained by 
the windowing method used by C4.5 for example. Our methodology allows the reduction of a 
learning set without information loss and it also gives a smaller data set to perform the induction 
task. Computational time of algorithms is sometimes important, though the less examples it process 
the faster it performs induction process. We are improving our algorithm in order to perform an 
example/attribute selection, this software should be a useful tool to the KDD community. 
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Abstract. In applications where preferences are sought it is desirable to order 
instances of important phenomenon rather than classify them. Here we consider 
the problem of learning how to order instances based on spatial partitioning. We 
seek statements to the effect that one instance should be ranked ahead of 
another. 

A two-stage approach to data ranking is proposed in this paper. The first learns 
to partition the spatial areas using the largest irregular area to represent the same 
rank data recursively, and gets a series of spatial areas (rules). The second stage 
learns a binary preference function indicating whether it is advisable to rank one 
instance before another according to the rules obtained from the first stage. The 
proposed method is evaluated using real world stock market data set. The results 
from initial experiments are quite remarkable and the testing accuracy is up to 
71.15%. 



1 Introduction 

Work in inductive learning has mostly concentrated on learning to classify. However, 
there are many applications in which it is desirable to order instances rather than 
classify. An example is the personalised stock selecting system that prioritises 
interesting stocks and states that one stock should be ranked ahead of another. Such 
preferences are often constructed based on a learned probabilistic classifier or on a 
regression model. 

Ordering and ranking have been investigated in various fields such as decision 
theory, social sciences, information retrieval and mathematical economics. For 
instance, it is common practice in information retrieval to rank documents according 
to their probability of relevance to a query, as estimated by a learned classifier for the 
concept “relevant document”. 

Cohen, et al [1] propose a ranking method that builds a preference function from a 
set of “ranking experts”, and the preference function is then used to construct an 
ordering function that agrees best with the experts. The method used by Bartell, et al 
[2] adjusts system parameters to maximise the match between the system’s document 
ordering and the user’s desired ordering, given by relevance feedback from experts. 
Bartell [3] proposes a method by which the relevance estimates by different experts 
can be automatically combined to result in superior retrieval performance. All these 
methods need a set of “ranking-experts”, and can’t rank data via analysing the data 
themselves. 

Wang, et al [4] propose a novel method of data reduction to utilise the partial order 
of data viewed as an algebraic lattice. Data and knowledge are represented uniformly 



K.S. Leung, L.-W. Chan, and H. Meng (Eds.): IDEAL 2000, LNCS 1983, pp. 78-84, 2000. 
© Springer-Verlag Berlin Heidelberg 2000 




Data Ranking Based on Spatial Partitioning 79 



by a logical structure called a hyper relation, which is a generalisation of a database 
relation ( 2 -d table). However, its data reductive operation is based on regular areas in 
the data space, so its expressive ability is limited. 

The method proposed in this paper is an attempt to rank data automatically based 
on data-spatial partitioning, using the biggest irregular area to represent the same rank 
data, and to identify the distributed areas of the same rank data via learning from a 
training data set. The advantage of this method is that it is simple and automatic. It can 
make judgements automatically via analysing data as it changes to adapt the requests 
of dynamic scenarios (exemplified here by stock markets), and it doesn’t need the 
intervention of “ranking-experts”. 

The remainder of the paper is organised as follows. Section 2 introduces the 
definitions and notation. Section 3 describes the data ranking algorithm based on 
spatial partitioning, in which the execution process of an example of the algorithm is 
demonstrated by graphical illustration. The experimental results are described and the 
evaluation is showed in Section 4 . Section 5 ends the paper with a discussion and 
future work. 

2 Definitions and notation 

Given two points />,, pj and two spatial areas a„ Oj in multidimensional space, we 
represent pi={pi„ Pn,---, pj, Pj=(Pji, Pj2,---, Pjn), and a,=([t,7i, tu2\, [tai, fe],---, [tini, 
tini]), aj Hltjii, [tj2i, tj22],..., f,„2]), in which, [t,„, /.c] is the projection of a, to 

its l-th component, and t,/2 > tm, 1 = 1 , 2 , ..., n. In this paper, for simplicity and 
uniformity, any point pi is represented as a spatial area in multidimensional space, viz. 
Pi ={[Pih Pii\, \Pi2, Pi2\,-", \Pim Pin])- This is ofteu a more convenient and uniform 
representation for analysis. 

DeHnition 1 Given two areas a„ Oj in multidimensional space, the merging operation 
of two areas denoted by ‘ u ’ can be defined as ai'uaj =([min(G;, tjn), max(/,;2, t/;2)]> 
[min(t,-2/, tj2i), max(/i22, f,22)], •••, [min(f,„;, tpi), max(/,„2, tj„2)])- 

The intersection operation ‘n’ of two areas in multidimensional space can be defined 
as a, n aj =([max(t,7;, tjn), min(t, 72 , tji2)], [max(t,2;, tj2i), min(t,22, t722)],---,[max(t,-„7, 
f,„7), min(f,-„2, t,„2)])- cii n aj is empty only if there exists a I such that maxitm, tju)> 
min(f,72, tji2), where 1 = 1 , 2 ,..., n. 

A point merging (or intersection) with an area can be regarded as a special case 
according to above definition. 

Definition 2 Given an area Oj in multidimensional space, denoted as aj=({tju, tji2], [tj2i, 
tj22\, ..., [tj„7, tj„2\), the complementary operation of aj is defined as Oj =( , 

)^j 2 h (/ 22]5 [(//li, (/h 2 ]) ([t/ 77 , tj] 2 ],[t j 2 ^,t J22] » [(/ni, tjn 2 ])tJ ... U ([tjii, tj] 2 ], [tj 21 , tj 22 ], 

[f>,T,,J)u...u([t,,„f,J,[f,,„t,J, ..., [t,.,,/,„2]), it is the area in the 

multidimensional space complementary area aj. 

Definition 3 Given a point pj denoted as p, = (pu, p^,..., Pi„) and an area aj denoted as 
a]=(ltjii, tji2], [tj2i, tj22],---, [tjnh tjn2]) in multidimensional space, the universal hyper 
relation ‘<’ is defined as p,- < aj (called pi falling into area of a f), if for all I, tju< pu< 
tji2, where 1 = 1 , 2 , ..., n. 
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Definition 4 Let D be a finite set of instances. A preference function Pref is a binary 
function Pref: D x D— >[0,1]. For u, veD, Pref{u,v)-\ (respectively 0) is interpreted as 
a strong recommendation that u should be ranked above (respectively, below) v. A 
special value, denoted ±, is interpreted as an abstention from making a 
recommendation. The results of data-spatial partitioning will be a series of spatial 
partition areas (rules) which represent the different rank data, and new instances will 
be ranked according to their dropping into different spatial areas. 

For simplify, all the data attributes used in this paper for data ranking are 
numerical. Set union operation (respectively intersection operation, complementation 
operation) can be used to replace the ‘ u ’ operation (respectively ‘n’ operation, 
‘-’operation) defined above for categorical data or binary data. 

3 Ranking algorithm 

Let a training data set be /) = fi ^ 2 , ..., where d. = (d.^,d.^,---,d.^^). d,isapoint 

in multidimensional space which can be represented as d, =([du, du], [ds, da], ..., [di„, 
din]) using spatial area representation, where is a continuous valued decision 
attribute. The data in the training data set D is sorted in decreasing order by decision 
value at the beginning of spatial partitioning. The sorted result is denoted as 
Do={ d° ,d°,--- ,dl). The sorted data set is divided into k equal sets from top to 
bottom, is a parameter using for tuning algorithm to optimal dividing. The i-th part 
has \qi\ data, and e D„|_/ = [_((i-l)*n)/A:J+l,---,[_(i*n)/A:J, !=1, 2,.., k}. 

Sorting and dividing should be done before the data space is partitioned. The 
partitions are continuous spaces which might order (see Figure 2.1). 

The spatial partitioning algorithm is as follows. 

1. t=0,set ebe a constant. 

2. n=\D,\,M: = M{d'| , d] e D,,i=\,2,...,k. 

3. M\ . = M'lC] M'j ,i^j and ij -1,2, ...,k . 

4. S'.-M'. nM', n...nM' ._j ,i=\,2, ...,k . 

5. D,+i ={ d\ I d[ e M'- ,i fj and ij= 1, 2, ...,k }. 

6- If 0p,^,\<e)goto8 

7. t=t+l, go to 2 

8. R,= {S°,S!,---,S‘} ,i=l,2,...,k 

Some symbols above are: S. -the biggest irregular area of i part obtained in t-th 
recurrence. D, - training data set used in t-th recurrence, andM' - merging area of i-th 
part data obtained in -th recurrence, M'.j -the intersection of M'. and Af ' obtained in 
t-th recurrence. 

It is important to pre-process the training data set. After sorting and dividing the 
data in the training data set, the decision value of data in the upper parts is greater than 
that in the lower parts. The partition areas denoted as Ri, in which, i=l, 2, ..., A: are 
obtained eventually by running the above algorithm, then the function Pref(u, v) of u. 
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V in the testing data set can be calculated easily based on the definition of preference 
function. 

Given a testing data set and Si e obtained from the partitioning process, 
where t=l, 2, k. For any testing data u, ve , Pref{u, v) is calculated using the 

universal hyper relationship <, by judging which area ii, v belong to and ranking 
accordingly. 

The algorithm is as follows. 

• If M < S' , V < S'' , where j > i (respectively j < i), l=\, 2, ..., t then Pref(u, v) 
=1. M is preferred to V (respectively Pre/(u, v) =0 > v is preferred to t<). 

• If M, V < S,' , viz. u and v fall into the same area of S', , then Prefiu, v)= ±, viz. u, v 
have no preference relationship. 

• If M < S', and V < S’ or M < S' and v < S’ , in which, I, i^j and I, q = 1,2, ...,t 
then Prefiu, v)= ±, viz. u, v have no preference relationship. 

• If there is not such S' which u or v belongs to, where i=l, 2, k and 1=1, 2, ..., t 
then Prefiu, v)=±, viz. u, v have no preference relationship. 

Consider Figure2-1 for example, the data set D consists of 30 data points. The data 
points are sorted in decreasing order according to their decision value. The sorted data 
set is represented as Dq. To simplify, the data in Do are divided into three parts and are 
represented in different colours. Black, grey and white represent the data in the upper 
part, middle part and lower part respectively. Obviously, Black data in the upper part 
of D„ have greater decision value than the grey and the white ones, and grey data in 
the middle part of D„ have greater decision value than the white one. 




See Figure 2-1, after merging all the data in the same part, three spatial areas of 
M° ,M°,M° denoted by bold line, fine line and broken line respectively are obtained. 
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The intersectional areas of M° denoted , can be seen in 

Figure2-2, given by: M =M° nM°, M°, = M° n M° , M°, = M° n M° 

The largest irregular areas of the same rank data at this stage are obtained and 
denoted asS°,S°, S° , which have the formula as following: 



5“ = M " n M "3 n M “3 , S° = M° n M n M , S° = M" n M "3 n M “3 
Obviously, if a testing data item u falls into S° and another testing data item v 
falls into or ^ 3 “ thenPre/ (u, v)=l; If u falls into 5“ and v falls into S“ then 
Pref{u, v)=0; If u falls into ancT falls into 53 “ then Pref{u, v)=l; If u falls into ^ 3 “ 
and V falls into S° or 5° then Pref{u, v)=0; If u falls into either or M°^ , 01 , 

no matter where v falls into, the relationship between u and> can not be estimated, so 
all the data which belong to one of M °^ , M°^ andM°j are taken out from the Do as a 



new training data setZ), , see Figure2-2. The new training data inD^ are sorted in 
decreasing order according to their decision values and are divided into three parts 
again. Notice that the colour of some data might be changed if they belong to different 
part after dividing (see Figure2-4 and Figure2-5). This is due to the feature of the 
algorithm when spatial partitioning are not finished from iteration to iteration. The 
process of merging and partitioning is repeated until there are no more than e data in 
the new training set , see Figure2-3 to Figure2-10. 




Figure2-6 



Figure2-7 



Figure 2-8 Figure 2-9 Figure 2-10 
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In the end, a series of partition areas of each part are obtained, in which, for 
example each , q=0, 1, ..., t are shown in grey background. 

4 Experiment and evaluation 

The ultimate goal of data spatial partitioning is to improve prediction accuracy. So we 
designed an experiment to evaluate the proposed data ranking method to see how well 
it performs in prediction with a real world database. A ranking system called 
Rank&Predict using the proposed method above has been developed. 

Stocks closing data for the United Kingdom in 45 trading days, beginning on 2 
Aug. 1999, and ending on 1 Oct. 1999, were collected from Reuters stock market 
repository and stored into a file named Dataftse. There were 96 stocks with 17 
numerical attributes in one trading day. The parameter k is tuned to 3 for spatial 
partitioning. 

Each continuous ten-day stock data in one month from 2 Aug. 1999 to 2 Sept. 
1999 were chosen for training and testing. The prediction efficiency of algorithm was 
evaluated using 5-fold cross validation, and the testing accuracy used in experiment is 
defined as: 

TA-{{the number of correct ranking)/(the total number of ranking))* 100. 

The results are shown in Table 1, in which the acronyms are: TA-Testing 
Accuracy, VP-the Value of Parameter k, and PSD-the Period of Sampling Data. 



PSD (1999) 


TA:VP(3) (%) 


PSD (1999) 


TA:VP(3)(%) 


2 Aug. -13 Aug. 


72.03 


3 Aug.- 16 Aug. 


67.92 


4 Aug.- 17 Aug. 


65.98 


5 Aug.- 18 Aug. 


65.42 


6 Aug. -19 Aug. 


76.51 


9 Aug.-20 Aug. 


70.45 


10 Aug. -23 Aug. 


75.22 


1 1 Aug. -24 Aug. 


72.47 


12 Aug.-25 Aug. 


59.16 


13 Aug.-26 Aug. 


81.88 


16 Aug. -27 Aug. 


82.87 


17 Aug.-30 Aug. 


72.83 


18 Aug. -31 Aug. 


66.02 


19 Aug.-l Sept. 


70.18 


20 Aug. -2 Sept. 


68.28 


Average 


71.15 



Table 1: The testing accuracy obtained by Rank&Predict. 



Stock market data are rich in noise level, filled with random and unpredictable 
factors, but the results from initial experiments are quite remarkable, and its average 
testing accuracy is 71.15%. It might be possible to make money from rank&predict. 

5 Discussion and Future work 

This paper presents a novel approach to data ranking based on spatial partitioning. The 
spatial partition areas can be regarded as a model of the raw data. The important 
feature of this ranking method is to find the largest irregular area to represent the same 
rank data. It executes union, intersection, and complement operations in each 
dimension using the projection of spatial areas in multidimensional space. It also 
represents the same rank data using the biggest irregular spatial area to realise the goal 



84 G. Guo, H. Wang, and D. Bell 



of data ranking. A series of spatial partition areas are obtained after learning from a 
training data set using the spatial partitioning method, and these can be used easily in 
ranking subsequently. 

We have shown that the proposed automatic data ranking method can be regarded 
as a novel approach to data mining to discover potential rank patterns in databases. A 
ranking system called Rank&Predict has been developed. Results from initial 
experiments on stock market data using Rank&Predict system are quite remarkable. 
Further research is required into how to eliminate noise to improve testing accuracy. 
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Abstract. It is desirable to automatically learn the effects of actions in an 
unknown environment. C4.5 has been used to discover associations, and it can 
also be used to find causal rules. Its output consists of rules that predict the value 
of a decision attribute using some condition attributes. Integrating C4.5's results 
in other applications usually requires spending some effort in translating them 
into a suitable format. Since C4.5's rules are horn clauses and have the same 
expressive power as Prolog statements, we have modified standard C4.5 so it 
will optionally generate its rules in Prolog. We have made sure no information is 
lost in the conversion process. It is also possible for the prolog statements to 
optionally retain the certainty values that C4.5 computes for its rules. This is 
achieved by simulating the certainty values as the probability that the statement 
will fail for no apparent reason. Prolog can move from statement to statement 
and find a series of rules that have to be fired to get from a set of premises to a 
desired result. We briefly mention how, when dealing with temporal data, the 
Prolog statements can be used for recursive searches, thus making C4.5's output 
more useful. 



1 Introduction 

C4.5 [4] allows us to extract classification rules from observations of a system. The 
input to C4.5 consists of a set of records. Each record contains some condition 
attributes and a single decision attribute. A domain expert should decide on which 
variable depends on others, and so Is to he considered the decision attribute. Though 
C4.5 has been traditionally used as a classifier, it can even be used to find temporal 
relations [3]. 

C4.5 uses a greedy algorithm with one look-ahead step. It computes the information 
contents of each condition attribute and the results are used to prune the condition 
attributes and create classification rules that are simpler than the original input records. 
The output of C4.5 are simple predicates like if {(a = 1) AND (b = 2)} then {(c = 4)}. 
There is an error value assigned to each rule, which determines the confidence in that 
rule. C4.5 creates a decision tree first and then derives decision rules from that tree. 
After this a program in the C4.5 package called "consult” can be used to actually 
execute the rules. It prompts for the condition attributes and then outputs the 
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appropriate value of the decision attribute. Any other use of the generated rules, 
including the integration of C4.5's results in other systems, may require some effort in 
changing the format of the results in order to make them compatible with the input of 
the other systems. 

C4.5's output can be expressed by horn clauses, so we have modified C4.5 to 
directly output Prolog statements in addition to its normal output. This enables the 
results to be readily used in any Prolog-based system, thus making them readily 
machine-processable. Much of the ground work for this has been presented in detail in 
[2], where we explain C4.5's usability in deriving association and causal rules in a 
simple environment. In that paper we also showed the usefulness of having Prolog 
statements in applications beyond those ordinarily expected from C4.5's output, which 
are usually meant to be used by people instead of automatic machine processing. In 
this paper we discuss the improvements and modifications that were needed to make 
the Prolog statements better represent C4.5's results. This includes preserving more 
information in the output and also handling the certainty values assigned by C4.5 to its 
output rules. This is important because Prolog by default considers all its statements to 
be always reliable. In Section 2 we explain how C4.5's rules are converted into Prolog 
Statements. Section 3 concludes the paper and gives information as to where the reader 
can find the patch file needed to modify C4.5. 



2 C4.5 Speaks Prolog 

The rules created by C4.5 can easily be represented as Prolog statements. The 
"c4.5rules'' program in the C4.5 package generates rules from a decision tree. We have 
modified this program to optionally output its resulting rules in Prolog [2]. This 
removes the need of a separate pass for the translation of the rules. When the user 
gives the command line option of '-p O' a <file stem>.pl file will be created in addition 
to the normal output. The generated Prolog statements are in the Edinburgh dialect and 
can be fed to most Prolog interpreters with no change. 

We used the modified c4.5rules program to create Prolog statements from data 
generated by artificial robots that move around in a two dimensional artificial world 
called URAL [6]. URAL is a typical Artificial Life simulator. The aim of the robot in 
this artificial world is to move around and find food. At each time-step the robot 
randomly chooses to move from its current position to either Up, Down, Left, or Right. 
The robot does not know the meaning or the results of any of these actions, so for 
example it does not know that its position may change if it chooses to go to Left. It can 
not get out of the board, or go through the obstacles that are placed on the board by the 
simulator. In such cases, a move action will not change the robot's position. The robot 
can sense which action it takes in each situation. The goal is to learn the effects of its 
actions after performing a series of random moves. To do this we save the robot's 
current situation after each action has taken place. 

The simulator records the current x and y locations, the move direction, and the next 
value of X or y. When these data are fed to C4.5, it correctly determines that the next x 
or y values depend only on the previous location and the movement direction [2]. One 
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example is a rule like: if {(XI = 1) AND (A1 = L)} then (X2 = 0) which states that the 
result of a move to the left is a decrease in the value of x. A1 and XI are the previous 
action and x value respectively. X2 represents the next x value. Table 1 shows a few of 
the Prolog statements generated by the modified c4.5rules as described in [2]. The first 
statement corresponds to the example C4.5 rule just presented. A1 and XI are as 
explained in that rule, and the classification is done on the next value of x. 



class(Al,Xl,0) 


- A1 =2, XI = 1. 


class(Al,Xl,2) 


- A1 =3,X1 = 1. 


class(Al,Xl,3) 


- A1 =3, XI =2. 



Table 1. Three Sample Prolog statements generated by C4.5 

In Table 1 a value of 2 and 3 for action A1 could mean going to the left and right, 
respectively. Following C4.5's terminology, the results are designated by a predicate 
called "class.” If we had to merge the Prolog statements generated for different 
decision attributes (because we might need to work with more than one decision 
attribute) then we could manually rename "class" to something like "classx” and 
remove any name clashes. In the above case this could happen we want to represent 
the rules for the moves along the y axis too. In the left-hand side of the Prolog 
statements the condition attributes that are involved in the decision making process 
come first, and the value of the decision attribute comes last. In the head of the rules, 
the condition attributes' values are used for the decision making process. All other 
condition attributes are ignored. 

Notice that the automatically generated Prolog statements use Prolog's unification 
operator (=) instead of the comparison operator (=:=). This allows the user to traverse 
the rules backward and go from the decision attribute to the condition attributes, or 
from a set of decision and condition attributes, to the remaining condition attributes. 
Some example queries are class(Al, 1, 2) (which actions take the creature from ;c = 1 
to X = 2?) or class(Al, XI, 3) (which action/location pairs immediately lead to x = 3?). 
The ability to interpret the rules in different directions makes C4.5’s discovered rules 
more useful when represented as Prolog statements. 

C4.5 can generate rules that rely on threshold testing and set membership testing. If 
we use the standard Prolog operators of =< and > for threshold testing, and implement 
a simple member() function for testing set membership, then we would not be able to 
traverse the rules backward, as they lack the ability to unify variables. So if we had a 
clause like: class(A, B, C) :- A =< B, member(A, C), then we would be unable to use a 
query like class(A, 3, [1, 2, 3]), because A is not unified, and Prolog can not perform 
the test =< on variables that are not unified. Adding the unification ability to =<, > and 
memberO will remove this limitation. For example, a unifying X > 10 would choose a 
value above 10 for X if it is not already unified, and a unifying member(X, [1, 2, 3]) 
would unify X with one of 1, 2, or 3 if it is not already unified. Both cases would 
always succeed if X is not unified, because they make sure that the variable is unified 
with an appropriate value, but could fail if X is already unified because in that case a 
test will be performed. 
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In [2] we have provided simple code to do the unification. There we show that this 
unification ability allows the Prolog statements to be traversed backwards, and so they 
can be used for automatic planning purposes. This was achieved by first noticing that 
in Table 1 we are dealing with temporal data, since the decision attribute (the next x 
position) is actually the same as one of the condition attributes seen at a later time. 
This can be used to come up with a sequence of moves to change the x or y position 
one by one and get to a desired destination. Suppose we are at jc = 1 and the aim is to 
go from there to x = 3. From Table 1 Prolog knows that to go to x = 3 one has to be at x 
= 2 and perform a Right move, but to be at x = 2, one has to be at x = 1 and again go to 
the right. This complete the planning because we are already at x = 1. In other words, 
X7 = 1 in a Prolog statement actually means class(_, _, 1), so the third statement in 
Table 1 can be converted to this: class(Al, XI, 3) A1 = 3, XI = 2, class(_, _, 2). 
Intuitively this means that to go to x = 2 with a move to the right one has to be at x = 2 
(class(_, _, 2)), and we do not care how we got there. Now prolog will have to satisfy 
class(_, _, 2) by going to x = 2. The changes that have to be done in the statements to 
allow Prolog to do such recursive searches come in detail in [2]. 

There is a problem in the way rules were represented in [2]. We use example Prolog 
statements from the more complex Letter Recognition Database from University of 
California at Irvine's Machine Learning Repository [1] to clarify our point. This 
database consists of 20,000 records that use 16 condition attributes to classify the 26 
letters of the English alphabet. The implementation in [2] would give us the kind of 
rules seen in Table 2 below. 



class(A10, A13, A14, A15, 8) AlO = 7, A13 = 0, A14 = 9, A15 = 4. 

class(A10, All, A13, 8) A10= 13, All =3, A13 = 3. 

class(A10, A12, A13, A14, A15. 8) AlO = 8, A12 = 5, A13 = 3, A14 = 8, A15 = 5. 

class(A7, A13, A14, A16, 8) A7 = 9, A13 = 0, A14 = 9, A16 = 7. 

class(A6, AlO, All. A13, 8) A6 = 9, AlO = 7, A1 1 = 6, A13 = 0. 



Table 2. Some of the Prolog statements from the Letter Recognition Database. 

The 16 attributes are named A1 to A16. The decision attribute encodes the index of 
the letters. Table 2 shows some of the rules created for the letter "I." The problem with 
this form of rule output is that Prolog's variable names are limited in scope to the 
statement in which they appear. In the last two statements for example, as far as Prolog 
is concerned A6 and A7 represent the same thing: Both are place-holders for the first 
argument of the predicate "class" that has a total of five arguments. This means that the 
user can not use the name of a variable as he knows it to get its value. A representation 
like this would allow the user to perform queries such as class(9, 7, A14, 0, 8), and get 
the answer A14 = 6, which is probably not what he had in mind. This happens because 
Prolog is using the last statement to derive this result, and that statement actually 
concerns A1 1 and not A14. 

To prevent this loss of information and the subsequent confusion, the modified 
c4.5rules program was changed to preserve all the condition attributes in the left-hand 
side of the Prolog statements. This allows Prolog to distinguish among the condition 
attributes by using their position, so now there is a way for the user to specify the 
variables unambiguously. The resulting statements are shown in Table 3. 
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class(Al, A2, A3, A4, A5, A6, A7, A8, A9, AlO, All, A12, A13, A14, A15, A16, 8) 

AlO = 7, A13 = 0, A14 = 9, A15 = 4. 

class(Al, A2, A3, A4, A5, A6, A7, A8, A9, AlO, All, A12, A13, A14, A15, A16, 8) 

AlO = 13, All =3, A13 = 3. 

class(Al, A2, A3, A4, A5, A6, A7, A8, A9, AlO, All, A12, A13, A14, A15, A16, 8) 

AlO = 8, A12 = 5, A13 = 3, A14 = 8, 

class(Al, A2, A3, A4, A5, A6, A7, A8, A9, AlO, All, A12, A13, A14, A15, A16, 8) 

A7 = 9, A13 = 0, A14 = 9, A16 = 7. 

class(Al, A2, A3, A4, A5, A6, A7, A8, A9, AlO, All, A12, A13, A14, A15, A16, 8) 
A6 = 9, A10 = 7, All =6, A13 = 0. 



Table 3. Prolog statements with all the condition attributes present. 

The statements are longer, but now the users can make sure Prolog understands 
what a query means. They can issue a query such as class(_, _, _, _, _, _, 9, _, _, _, _, 
_, 0, A14, _, 7, 8) and get the correct result of A14 = 19. 

There is still something missing from the Prolog statements. C4.5 assigns a certainty 
value to each rule it generates, which shows how reliable that rule is. Standard Prolog 
does not support the notion of reliability of a statement. To convey this information to 
the Prolog statements in a way that would be understandable to most Prolog systems, 
we used a random number generator to fail the rules proportional to their certainty 
value. A random integer is generated and tested against the certainty value of the rule. 
A statement can fail if this test fails, no matter what the value of the condition 
attributes. C4.5 computes the certainty values as a number less than 1 and outputs them 
with a precision of 0.001. The modified c4.5rules program multiplies this number by 
1000 to avoid having to deal with real numbers. The modified c4.5rules program 
outputs the necessary code to handle the certainty value if the user invokes it with a '-p 
1' command line argument. We used a different command line argument for this 
because the user may not always need to have the certainty values in the output. 
Statements resulting from this argument are given in Table 4. 



class(Al, A2, A3, A4, A5, A6, A7, A8, A9, AlO, All, A12, A13, A14, A15, A16, 8) 

random(1000, N_), N_ < 917, AlO = 7, A13 = 0, A14 = 9, A15 = 4. 

class(Al, A2, A3, A4, A5, A6, A7, A8, A9, AlO, All, A12, A13, A14, A15„ A16, 8) 

random(1000, N_), N_ < 870, AlO = 13, A1 1 = 3, A13 = 3. 

class(Al, A2, A3, A4, A5, A6, A7, A8, A9, AlO, All, A12, A13, A14, A15„ A16, 8) 

random(1000, N_), N_ < 793, AlO = 8, A12 = 5, A13 = 3, A14 = 8, 

class(Al, A2, A3, A4, A5, A6, A7, A8, A9, AlO, All, A12, A13, A14, A15„ A16, 8) 

random(1000, N_J, N_ < 793, A7 = 9, A13 = 0, A14 = 9, A16 = 7. 

class(Al, A2, A3, A4, A5, A6, A7, A8, A9, AlO, All, A12, A13, A14, A15„ A16, 8) 
random(1000, N_J, N_<707, A6 = 9, AlO = 7, All =6, A13 = 0. 



Table 4. Prolog statements with the ceratinty values. 

We use the first rule in Table 4 to explain the statements. The first rule has a 

certainty value of 91.7%. The random(1000, N ) function assigns a number between 

0 and 999 to N . This value is then compared to the certainty value of the rule, which 

is 917. The statement could fail based on the results of the comparison. The random 
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number is named N to lessen the chances of an accidental clash with the name of a 

condition attribute. One could implement the random number generator like this [5]: 
seed(13). 

random(R, N) seed(S), N is (S mod R), retract(seed(S)), 

NewSeed is (125 * S + 1) mod 4096, asserta(seed(NewSeed)), !. 

We have taken an active approach to representing the certainty values, because they 
can actually cause the statements to fall. In an alternate Implementation, we could 
choose to simply output these values as part of the statements and leave any specific 
usage to the user. An example, taken from the last statement in Table 4, would be 
class(Al, A2, A3, A4, A5, A6, A7, A8, A9, AlO, All, A12, A13, A14, A15, A16, 
N_, 8) A6 = 9, AlO = 7, A1 1 = 6, A13 = 0, N_ = 707. 



3 Concluding Remarks 

Outputting C4.5's classification rules as Prolog statements allows them to be more 
useful. They can be traversed In any direction, and can be readily integrated into 
Prolog-based systems. Prolog's searching abilities can readily be employed to do 
searches over the rules and go from rule to rule as in a traditional theorem prover. This 
becomes more useful with the kind of temporal data explained in the paper. The 
certainty values can be ignored in the output Prolog statements if they are to be used in 
deterministic environments. The user can also opt to simulate the possible failure of the 
rules by having the certainty values represented in the rules as thresholds for test 
against randomly generated numbers. 

The modified c4.5rules program retains backward compatibility, and its output is 
unchanged when the new options are not used. The modifications are available in the 
form of a patch file that can be applied to standard C4.5 Release 8 source files. It is 
freely available from http://www.cs.uregina.ca/~karlmi or by contacting the authors. 
C4.5 Release 8's sources are available for download from Ross Quinlan's webpage at 
http://www.cse.unsw.edu.au/~qulnlan/ 



References 

1. Blake, C.L and Merz, C.J., UCI Repository of machine learning databases 
[http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, 
Department of Information and Computer Science, 1998. 

2. Karimi, K. and Hamilton, H. J., "Learning With C4.5 in a Situation Calculus Domain," The 
Twentieth SGES International Conference on Knowledge Based Systems and Applied 
Artificial Intelligence (ES2000), Cambridge, UK, December 2000. 

3. Karimi, K. and Hamilton, H. J., "Finding Temporal Relations: Causal Bayesian Networks vs. 
C4.5." The I2th International Symposium on Methodologies for Intelligent Systems 
(ISMIS'2000). Charlotte, NC, USA. 

4. Quinlan, J. R., C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. 

5. Clocksin, W.F., Melish, C.S, Programming in Prolog, Springer Verlag, 1984. 

6. http://ww.cs.uregina.ca/~karimi/URAL.java 




Visualisation of Temporal Interval Association Rules 



Chris P. Rainsford' and John F. Roddick^ 

' Defence Science and Technology Organisation, DSTO C3 Research Centre 
Fernhill Park , Canberra, 2600, Australia. 

Chris . rainsf ordOdsto . defence . gov. au 
^ School of Informatics and Engineering, Flinders University of South Australia 
GPO Box 2100, Adelaide 5001, Australia. 
roddickScs . flinders . edu . au 



Abstract. Temporal intervals and the interaction of interval-based events are 
fundamental in many domains including medicine, commerce, computer 
security and various types of normalcy analysis. In order to learn from 
temporal interval data we have developed a temporal interval association rule 
algorithm. In this paper, we will provide a definition for temporal interval 
association rules and present our visualisation techniques for viewing them. 
Visualisation techniques are particularly important because the complexity and 
volume of knowledge that is discovered during data mining often makes it 
difficult to comprehend. We adopt a circular graph for visualising a set of 
associations that allows underlying patterns in the associations to be identified. 
To visualize temporal relationships, a parallel coordinate graph for displaying 
the temporal relationships has been developed. 



1 Introduction 

In recent years data mining has emerged as a field of investigation concerned with 
automating the process of finding patterns within large volumes of data [9]. The 
results of data mining are often complex in their own right and visualisation has been 
widely employed as a technique for assisting users in seeing the underlying semantics 
[12]. In addition, mining from temporal data has received increased attention recently 
as it provides insight into the nature of changes in data [11]. 

Temporal intervals are inherent in nature and in many business domains that are 
modelled within information systems. In order to capture these semantics, we have 
developed an extension to the definition of association rules [1] to accommodate 
temporal interval data [10]. Association rules have been widely used as a data mining 
tool for market analysis, inference in medical data and product promotion. By 
extending these rules to accommodate temporal intervals, we allow users to find 
patterns that describe the interaction between events and intervals over time. For 
example, a financial services company may be interested to see the way in which 
certain products and portfolios are interrelated. Customers may initially purchase an 
insurance policy and then open an investment portfolio or superannuation fund with 
the same company. It may then be interesting to see which the customer terminates 
first. Likewise, a customer history may show that they have held investments in three 
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different investment funds. It may then be interesting to see if all three were held 
simultaneously, one following the other, or in some overlapping fashion. Looking for 
underlying trends and patterns in this type of behaviour is likely to be highly useful 
for analysts who are seeking to market these products, both to new investors and long 
term clients. In order to increase the comprehensibility of rules that describe such 
relationships, we have also developed two visualisation tools. The first tool uses a 
circular graph to display the underlying association rules. This allows the user to see 
patterns within the underlying associations. The second visualisation uses a parallel 
coordinate approach to present the temporal relationships that exist within the data in 
an easily comprehendible format. Importantly, both of these visualisation techniques 
are capable of displaying large numbers of rules and can be easily represented in a 
fixed two-dimensional format that can be easily reproduced on paper or other media. 

In the next section we will provide a definition for temporal interval association 
rules. Section 3 discusses our association rule visualisation tool. Section 4 then 
describes our temporal relationship visualiser. A conclusion is provided in Section 5. 



2 Temporal Interval Association Rules 

We define a temporal interval association rule to be a conventional association rule 
that includes a conjunction of one or more temporal relationships between items in the 
antecedent or consequent. Building upon the original formalism in [1] temporal 
interval association rules can be defined as follows: Let I = I,, l 2 ,.. ,I„ be a set of 
binary attributes or items and T be a database of tuples. Association rules were first 
proposed for use within transaction databases, where each transaction t is recorded 
with a corresponding tuple. Hence attributes represented items and were limited to a 
binary domain where t(k) = 1 indicated that the item I^ was positive in that case (for 
example, had been purchased as part of the transaction, observed in that individual, 
etc.), and t(k) = 0 indicated that it had not. Temporal attributes are defined as 
attributes with associated temporal points or intervals that record the time for which 
the item or attribute was valid in the modeled domain. Let X be a set of some 
attributes in I. It can be said that a transaction t satisfies X if, for all attributes I^ in X, 
t(k) = 1. Consider a conjunction of binary temporal predicates P, a P^. . .a P_^ defined 
on attributes contained in either X or Y where n > 0. Then by a temporal association 
rule, we mean an implication of the form X => Y a Pj a P 2 ...A P^, where X, the 
antecedent, is a set of attributes in I and Y, the consequent, is a set of attributes in I 
that are not present in X. The rule X => Y a P, a P 2 ...A P_^ is satisfied in the set of 
transactions T with the confidence factor 0 < c < 1 iff at least c% of transactions in T 
that satisfy X also satisfy Y. Likewise each predicate Pj is satisfied with a temporal 
confidence factor of 0 < tCpj < 1 iff at least tc% of transactions in T that satisfy X and 
Y also satisfy P.. The notation X => Y |c a PjtCA P 2 |tc...A P_^ |tc is adopted to specify 
that the rule X => Y a Pj a P 2 ...A P„ has a confidence factor of c and temporal 
confidence factor of tc. As an illustration consider the following simple example rule: 

policyZ => investX, productY | 0.79 a jMWng(investX, policyZ) | 

0.75 A before(productY , investX) | 0.81 
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This rale can be read as follows: 

The purchase of investment X and product Y are associated with insurance 
policy Z with a confidence factor of 0.79. The investment in X occurs during the 
period of policy Z with a temporal confidence factor of 0.75 and the purchase of 
product Y occurs before investment X with a temporal confidence factor of 0.81 

Binary temporal predicates are defined using Allen’s thirteen interval-based 
relationships between two intervals. A thorough description of these relationships can 
be found in [2]. We also use the neighborhood relationships defined by Freksa that 
allow generalisation of relationships [5]. A detailed description of our learning 
algorithm is beyond the scope of this paper and readers are referred to [10]. 



3 Visualising Associations 

Finding patterns within the temporal interval associations may be assisted with the 
use of visualisation techniques. This is particularly important where the number of 
association rules is found to be large and the discovery of underlying patterns by 
inspection is not possible. For this purpose we have devised two separate 
visualisation techniques. The first can be used to visualise any association rale and 
the second is specific to temporal associations. 

The visualisation of sets of association rules has been addressed in a number of 
different ways. One approach has been to draw connected graphs [6]. However, if 
the number of rules is large this approach involves a complex layout process that 
needs to be optimised in order to avoid cluttering the graph. An elegant three- 
dimensional model is provided in the MineSet™ software tool [4]. We have chosen to 
develop a visualisation that can handle a large volume of associations and that can be 
easily reproduced in two-dimensions, e.g. as a paper document, or an overhead 
projection slide. In addition, it provides an at-a-glance view of the data that does not 
need to be navigated and explored to be fully understood. This approach 
complements the approaches of others and is more applicable in some circumstances. 

We have adopted a circular graph layout where items involved in rules are mapped 
around the circumference of a circle, see Figure 1. Associations are then plotted as 
lines connecting these points, where a gradient in the colour of the line, from 
blue(dark) to yellow(light) indicates the direction of the association from the 
antecedent to the consequent. A green line highlights associations that are bi- 
directional and this allows bi-directional relationships to be immediately identified. 
Circular graph layouts have been successfully used in several other data mining 
applications, including Netmap [3], [8]. A key characteristic of this type of 
visualization is its ability to display large volumes of information. The circle graph 
gives an intuitive feel for patterns within the underlying data. For example, items that 
have several other items associated with them will have a number of blue lines 
leaving their node on the circle. These items may be selected for marketing to attract 
new clients, because it is likely that the clients will also purchase other items or 
services as part of an overall basket. Note however that no temporal information is 
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provided in this graph. In cases where the number of items is large, concept 
ascension may be employed to reduce complexity. 




^Association Visualisation 



Xom-1o 

jVCom-fy 



Fig. 1. A screenshot of our association rule visualisation window 



4 Visualising Temporal Associations 

Our first visualisation does not display the details of discovered temporal 
relationships between items in the association rules. In order to display this 
information it has been necessary to develop a new visualization tool. We have 
developed a simple visualisation technique based upon parallel coordinate 
visualisation. Parallel coordinate visualization has been used successfully in other 
data mining tools to display large volumes of data [7]. A screenshot of this 
visualisation is depicted in Figure 2. We start by plotting all of the items on the right- 
hand side of temporal predicates along a vertical axis. The items on the left-hand side 
of the temporal predicate are plotted along an axis on the opposite side of the screen 
with the labels for the thirty temporal relationships we have adopted lined along a 
central vertical axis. The temporal relationships can be seen as semi-ordered based 
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upon the location of two intervals with respect to each other along the time axis. 
Using simple heuristics we have imposed an artificial ordering upon the relationships 
in order to allow them to be represented meaningfully along a single line. We then 
draw lines between items that have a temporal relationship and the lines intersect the 
central axis at the point that corresponds to the nature of that relationship. The lines 
are coloured to reflect the temporal confidence associated with the underlying 
relationship. 




Fig. 2. A screenshot of our temporal interval visualisation window. 

Based upon this visualisation it is possible to quickly determine patterns within the 
data. For example, a financial services company may seek to identify marketing 
opportunities for items to its current clients. By looking for items on the right-hand 
side of the graph, that are connected via lines that run predominately through the top 
half of the temporal relationship line (corresponding to items purchased after the item 
on the left hand side). The market analyst may then seek to market these services to 
holders of the connected items on the left-hand side of the graph. The strongest such 
correlations can be identified based upon the colour of the line which indicates the 
confidence of the relationship. The colour of these lines can be observed to quickly 
estimate the strength of relationships. 
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5 Summary 

In this paper we have detailed two visualisation techniques to support the analysis of 
temporal interval association rules. These techniques are designed to allow a rapid 
understanding of patterns existing within large sets of rules. The first technique is a 
circular association rule graph that displays patterns within association rules. The 
second technique is based upon a parallel coordinate visualisation and it displays the 
temporal interval relationships between items. Both of these techniques have been 
successfully used for other data mining applications. Importantly, they are able to 
handle high volumes of data in a way that still allows users to find underlying 
patterns. These two techniques are simple and can be represented in two dimensions 
so that they can be easily reproduced. Research at both DSTO and at Flinders 
University is continuing and we plan to further refine these techniques and to examine 
their scalability to larger datasets. 
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Abstract. This paper reports the use of association rules for the discovery of 
lithofacies characteristics from well log data. Well log data are used extensively 
in the exploration and evaluation of petroleum reservoirs. Traditionally, 
discriminant analysis, statistical and graphical methods have been used for the 
establishment of well log data interpretation models. Recently, computational 
intelligence techniques such as artificial neural networks and fuzzy logic have 
also been employed. In these techniques, prior knowledge of the log analysts is 
required. This paper investigated the application of association rules to the 
problem of knowledge discovery. A case study has been used to illustrate the 
proposed approach. Based on 96 data points for four lithofacies, twenty 
association rules were established and they were further reduced to six explicit 
statements. It was found that the execution time is fast and the method can be 
integrated with other techniques for building intelligent interpretation models. 



1 Introduction 



Modern societies rely heavily on hydrocarbon products. It is an ongoing quest of the 
major oil companies to explore new petroleum reservoirs and to determine their 
viabilities of production. The first phase of characterising a petroleum reservoir is to 
carry out well logging of the region under investigation. Boreholes are first drilled and 
logging instruments are then lowered by a wireline to obtain the characteristics from 
the sidewalls. Examples of the measurements include sonic travel time, gamma ray, 
neutron density, and spontaneous potential. This information is collectively known as 
well log or wireline log data. Meanwhile, limited amount of physical rock samples are 
extracted at the respective depth intervals for core analysis. They are examined 
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intensively in the laboratory to obtain the corresponding petrophysical properties such 
as porosity, permeability, water saturation, volume of clay, and many other properties. 
Information determined from such analysis is termed core data. 

The two approaches of reservoir characterisation are the “genetic” approach of 
lithofacies classification [1] and petrophysical properties prediction [2]. The former 
can also be regarded as the identification of the litho hydraulic flow units. Flow unit is 
defined as the representative of physical rock volume in terms of geological and 
petrophysical properties that influence the fluid flow through it [3]. Since there is a 
strong correlation between the lithofacies and the corresponding petrophysical 
properties, an identification of the lithofacies will also provide an estimation of the 
petrophysical properties. However, accurate estimation of petrophysical properties 
from wireline logs is difficult to obtain due to the non-linear relationships between the 
two. 

On the other hand, while core data provides the most accurate information about 
the well, core analysis is expensive and time consuming. It is therefore a challenge for 
the log analyst to provide an accurate interpretation model of the petrophysical 
characteristics based on the available wireline log data and the limited core data. 
Traditionally, discriminant analysis, statistical and graphical methods have been used 
for the establishment of well log data interpretation models [4,5]. Recently, 
computational intelligence techniques such as artificial neural networks and fuzzy 
logic have also been employed [6,7,8]. The objective is to develop a description 
model to relate the petrophysical properties to the lithofacies. Such a model is based 
on the available data that embeds the knowledge about the reservoir. In other words, 
the problem is how to construct an accurate description model according to the 
properties found in the underlying data. 

While the task of establishing the interpretation model is not easy, based on prior 
knowledge, an experienced log analyst is however capable to identify the lithofacies 
through the wireline logs data. Such knowledge will be useful for subsequent 
explorations in the region. Since data mining is particularly suitable for the discovery 
of interesting and useful knowledge from a large amount of data [9,10,11,12], this 
approach is therefore appropriate for the problem of well log data analysis. In this 
paper, the application of association rules to knowledge discovery from well log data 
and lithofacies is explored. Such knowledge can be used to enhance the understanding 
of a region under investigation. In the following sections, a brief description of the 
association rules and the procedure of applying the approach are presented. Results 
from a test case based on 96 data points are reported and further research directions 
will also be discussed. 



2 Data Mining and Association Rules 



In recent years, research areas of Knowledge Discovery in Databases (KDD) and Data 
Mining (DM) have contributed much understanding to the automatic mining of 
information and knowledge from large databases. Its continual evolvement has 
enabled researchers to realise potential applications to many areas including database 
systems, artificial intelligence, machine learning, expert systems, decision support 
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system, and data visualisation. While a number of new computational intelligence 
techniques such as Artificial Neural Networks (ANN) and Fuzzy Logic (FL) have 
been applied to the problems of lithofacies identification and petrophysical prediction, 
the concepts of applying KDD and DM are new. In the context of this study, the 
objective is to apply appropriate computational techniques to facilitate understanding 
large amounts of data by discovering interesting patterns that exist in the data. The 
technique used in this study is the mining of Association Rules. 

The concept of association rules was introduced as a means of mining a large 
collection of basket data type transactions between sets of items. For example, an 
association rule is expressed in the form of “63% of transactions that purchase engine 
oil also purchase oil filter; 13% of all transactions contain both these items”. The 
antecedent of this rule consists of engine oil and the consequent consists of oil filter. 
The value 63% is the confidence of the rule, and 13%, the support of the rule. In this 
study, the items will be referred to the well logs such as Gamma Ray (GR), Deep 
Induction Resistive (ILD), and Sonic Travel Time (DT) and the lithofacies to be 
associated with are Mudstone, Sandy Mudstone, Muddy Sandstone, and Sandstone. 

Once the association rules are generated, knowledge discovery from the wireline 
log data will be expressed in linguistic rules, and its related lithofacies of interest. By 
reorganising the rules according to the lithofacies, the items that appears to be 
dominant or distinct in the rules common to each lithofacie will be selected. The 
process is illustrated in the following case study. 



3 Case Study 



A set of 96 data from two typical boreholes within the same region are used to 
demonstrate the use of association rules in discovering knowledge for the 
identification of lithofacies. In this study, the data set comprises of a suite of three 
wireline logs, GR, ILD, DT and their corresponding lithofacies, which are, mudstone 
(Class 1), sandy mudstone (Class 2), muddy sandstone (Class 3), and sandstone (Class 
4). The number of data sets in the four classes are 11, 37, 34 and 14 respectively. The 
procedure of mining association rules for lithofacies identification follows the one 
outlined by Agrawal and Srikant in reference [11]: 

Step 1: Determine the number of intervals or interval sizes for each wireline log. 

Step 2: Map the intervals of those wireline logs to consecutive integers such that the 
order of the intervals is preserved and the lithofacies to a set of consecutive 
integers. 

Step 3: Using the algorithm Apriori in [12], find all combinations of itemsets 
consisting well log data and lithofacies whose support is greater than the 
specified minimum support, which is c&W&dfrequent itemsets. 

Step 4: If the frequent itemsets satisfies the specified minimum confidence, then 
generate the association rules. 

The study was carried out on a Pentium II 300 MHz PC and the execution time 
recorded was 50 milli-seconds. A total of 20 association rules were generated. These 
rules were formatted and presented in the Table 1. 
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Tabic 1. Fonnatted output of the generated association rules. 




Mudstone 

Mudstone 

Mudstone 



Sand> Mudstone 
Sandy Mudstone 
Sandy Mudstone 
Sandy Mudstone 
Sandy Mudstone 
Sandy Mudstone 
Sand\ Mudstone 



Mudd> Sandstone 
Muddy Sandstone 
Muddy Sandstone 
Muddy Sandstone 
Muddy Sandstone 
Muddy Sandstone 
MuddN’ Sandstone 



Sandstone 

Sandstone 

Sandstone 



From the Table, it can be obser\ed that OR, ILD, and DT are subdivided into 4, 3, and 4 
inter\'als respective!)'. In order to make tlie knowledge to be extracted from these rules more 
explicit, these intervals are expressed in linguistic terms such as Low (L), Medium (M), High 
(H). ITiercfore, the intervals of GR can be regarded as {L, LM, MI 1, 11} , ILD as {L, M, H} , and 
D'r as {L, LM, Mil, II). Table 2 shows the niles indicating the responses from each log in 
linguistic tenns. 

Table 2. Expressing the association rules in linguistic terms. 
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By examining these rules, the presence of distinct wireline logs responses are 
observed. They are then identified for the implication of the lithofacies of interest. 
Knowledge from these data sets expressed in explicit statements is listed in Table 3. 



Table 3. Implication of distinct wireline log responses from lithofacies. 



Litho 


Knowledge expressed in linguistic statements 


1 


For Mudstone, ILD is Low and DT is likely to be Medium Low 


2 


For Sandy Mudstone, DT is Medium Low and ILD is likely to be Low 


2 


For Sandy Mudstone, DT is Medium Low and GR is likely to be Medium High 


3 


For Muddy Sandstone, DT is Medium High and ILD is either Low or Medium 


4 


For Sandstone, GR is High and ILD is likely to be Medium 


4 


For Sandstone, GR is High and DT is likely to be High 



4 Discussions 



Based on the above observations, the following points are deduced: 

1. Using the knowledge discovered from Table 2, we can extract some heuristics 
rules to comment the range of data, which can be used to imply the presence 
of lithofacies of interest. Such knowledge can be used by the log analysts for 
cross-examination or referencing with other wells within the region. For 
example, if the response of ILD is Low, and DT is Medium-Low, then the 
lithofacies of interest is likely to be Mudstone. 

2. We also observed that the lithofacies might not necessarily be associating all 
three logs. For example, it can be seen from Table 2 that Sandstone associates 
with a higher value of GR and a high value of DT. Alternatively, it is also 
associated with a higher value of GR and a medium value of ILD. This fact is 
also observed from the professional log analysts who tend to work on a 
limited number of logs at any one time. 

3. As some logs are more important and distinctive than the others this 
knowledge can be used to perform a contribution measure for lithofacies 
classification under the genetic approach. Such an approach is very important 
for the prediction model if the available input logs are very large. This also 
forms the basis of a modular approach to lithofacies identification and 
petrophysical characteristics prediction. 

4. Based on this case study, we can conclude that this proposed approach is fast 
and user-friendly as compared to other computation intensive methods such as 
artificial neural networks. 

With the discovered knowledge, it paves the way for further investigations. This 
will include prediction of petrophysical properties and lithofacies classification in 
petroleum reservoir characterisation. It is anticipated that the proposed technique will 
be integrated with other methodologies such as ANN, FL and expert systems in order 
to enhance the intelligence of the well log interpretation model. 
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5 Conclusion 

This paper has reported the use of association rules for mining knowledge from well 
log data and lithofacies characteristics. Well log data analysis is important as this 
forms the basis of decision making in any exploration exercise. By using the available 
log data and lithofacies classes, knowledge in explicit linguistic statements are 
obtained and presented to the users. This will assist the decision makers in gaining a 
better understanding of the data and information about the region. In the example case 
study, the execution time is very fast and the extracted knowledge can be integrated 
with other prediction techniques in order to build a more intelligent and reliable data 
interpretation model. 
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Abstract: This paper reports on the use of a fuzzy rule interpolation technique for the 
modelling of hydrocyclones. Hydrocyclones are important equipment used for particle 
separation in mineral processing industry. Fuzzy rule based systems are useful in this 
application domains where direct control of the hydrocyclone parameters is desired. It has been 
reported that a rule extracting technique has been used to extract fuzzy rules from the input- 
output data. However, it is not uncommon that the available input-output data set does not 
cover the universe of discourse. This results in the generation of sparse fuzzy rule bases. This 
paper examines the use of an improved multidimensional fuzzy rule interpolation technique to 
enhance the prediction ability of the sparse fuzzy hydrocyclone model. Fuzzy rule interpolation 
is normally used to provide interpretations from observations for which there ate no overlaps 
with the supports of existing rules in the rule base. 

1. Introduction 

Mining and mineral processing are two important industries in Australia. The quality 
of the products depends heavily on the precise and efficient refinement and separation 
of the particles according to size and type. One of the most commonly used 
instruments for this purpose is the Hydrocyclone [1]. Hydrocyclones are used to 
classify and separate solids suspended in fluids, commonly known as slurry. The 
particles will leave the hydrocyclone through an underflow opening known as the 
spigot. On the other hand, an upward helical flow containing fine and lighter solid 
particles will exit via the vortex finder on top known as upperflow. For a 
hydrocyclone of fixed geometry, the performance of the system depends on a number 
of parameters. The separation efficiency of particles of a particular size is determined 
by an operational parameter known as dSOc. This value indicates that 50% of particles 
of a particular size is reported to the upper and underflow streams. 

The correct estimation of d50c is important since it is directly related to the efficiency 
of operations and it will also enable control of the hydrocyclone as illustrated by 
Gupta and Eren [2]. Computer control of hydrocyclones can be achieved by 
manipulation of operational parameters such as: diameter of the spigot opening (Du), 
the vortex finder height (H), the inlet flowrate (Qi), the density (Pi) and the 
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temperature (T) of slurries for a desired d50c. Traditionally, mathematical models 
based on empirical methods and statistical techniques in describing the performance 
of the hydrocyclones are used. Although these approaches have long been established 
in the industry, they have their shortcomings. For example, the experimental 
conditions may vary, resulting in these empirical models being unreliable. Hence, the 
conventional approach may not be universally applicable. In recent years. Artificial 
Neural Network (ANN) [3, 4] and Neural-Fuzzy [5] techniques have been applied. 
Although ANN techniques have proven to be useful for the prediction of the dSOc 
control parameter, the main disadvantage is their inability to convey the acquired 
knowledge to the user. As a trained network is represented by a collection of weights, 
the user will have difficulty in understanding and modifying the model. In many 
cases, the system may not gain the confidence of the user. The Neural-Fuzzy approach 
can be shown to be better than the ANN approach as it can generate fuzzy rules for 
the user to manipulate. However, the fuzzy rules generated to cover the whole sample 
space are too tedious for the user to examine. 

In this paper, a fuzzy hydrocyclone model is proposed. By modifying the on-line 
control system shown in [2], the proposed fuzzy hydrocyclone model is shown in 
Figure 1. As in [2], the d50c is set to a desire value. The signals from the instruments 
are processed to calculate the present value of dSOc using the conventional models. To 
minimise the differences between the set value and the present value, the operating 
parameters such as diameter of the spigot opening (Du), the vortex finder height (H), 
and the inlet flowrate (Qi) are changed sequentially until the desired value of d50c is 
obtained. This is significant as the proposed technique allows users to manipulate the 
fuzzy rules easily, which also allows the system to perform in situations where no 
rules are found. 



temperature & 
density of 




d50c predicted 



Figure 1; Online Fuzzy Hydrocyclone Control System 

2, Fuzzy Hydrocyclone Control Model 

Fuzzy control systems have shown to be useful in dealing with many control problems 
[6]. By far, they are the most important application of the classical fuzzy set theory. 
However, conventional fuzzy systems do not have any learning algorithms to build 
the analysis model. Rather, they are based on human or heuristic knowledge, past 
experience or detailed analysis of the available data in order to build the fuzzy rule 
base for the control system. Therefore, the major limitation is this difficulty in 
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building the fuzzy rules. Recently, an automatic self-generating fuzzy rules inference 
system [7] has shown successful results in establishing the well log interpretation 
model. This method is used in this paper to extract fuzzy rules from the test examples 
generated by the hydrocyclone model. 

The steps involved in the self-generating fuzzy rules inference system are summarised 
as follows: 

(1) Determine the universe of discourse for each variable depending on its range 
of value. 

(2) Define the number of fuzzy regions and fuzzy terms for all data. For ease of 
extraction, only triangular types of membership functions are used. 

(3) The space associated with each fuzzy term over the universe of discourse for 
each variable is then calculated and divided evenly. 

(4) For each available test case, a fuzzy rule is established by directly mapping 
the physical value of the variable to the corresponding fuzzy membership 
function. 

(5) Go through Step (4) with all the available test cases and generate one rule for 
each input-output data pair. 

(6) Eliminate repeated fuzzy rules. 

(7) The set of remaining fuzzy rules together with the centroid defuzzification 
algorithm now forms the fuzzy interpretation mo del. 

3, Problems of a Sparse Rule Base 

To illustrate the problem of sparse rule base as described in the previous section, a 
practice case study is presented. Data collected from a Krebs hydrocyclone model 
D6B-12o-839 have been used. There are a total of 70 training data and 69 testing data 
used in this study. The input parameters are Qi, Pi, H, Du, and T and the output is 
dSOc. The self-generating fuzzy rules technique is used to extract fuzzy rules from the 
70 training data. 7-membership function has been selected as it gives the best result. 
There are a total of 64 fuzzy sparse rules generated from the rule extraction process. 

When this set of sparse rules are used to perform control on the testing data, 4 sets of 
data cannot find any fuzzy rules to fire and are shown in Figure 2. The output plot of 
the predicted d50c (solid line on the plot) as compared to the observed d50c (dots on 
the plot) is shown in Figure 3. The four zero output is the case where no rule fires. In 
this case study, the number of input sets that cannot find any rule to fire is considered 
minimal. However, in some cases, this may not always be true. If more than half the 
input instances cannot find any rule to fire, this control system may be considered 
useless. This is the major drawback for the fuzzy hydrocyclone control model. The 
problem also exists in most practical cases. 



Warning: no rule is fired for input [493.0 26.00 85.20 3.750 26.00 ]! 0 is used as default output. 

Warning: no rule is fired for input [388.0 24.20 69.50 2.650 33.00 ]! 0 is used as default output. 

Warning: no rule is fired for input [462.0 10.20 85.20 3.750 40.00 ]! 0 used as default output. 

Warning: no rule is fired for input [267.00 24.50 85.20 3.750 34.00 ]! 0 is used as default output. 

Figure 2: Warning message for input without firing rules. 
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Figure 3: Output plot showing test case results indicating the rules fired. 

4, Fuzzy Rule Interpolation 

In the case when a rule base contains gaps or is a sparse rule base, classical fuzzy 
reasoning methods can no longer be used. This is the problem highlighted in the 
previous section, as an observation finds no rule to fire. Fuzzy rule interpolation 
techniques provide a tool for specifying an output fuzzy set whenever at least one of 
the input universes is sparse. Koczy and Hirota [8] introduced the first interpolation 
approach known as (linear) KH interpolation. 

Two conditions can be applied for the use of linear interpolation. Firstly, there should 
exist an ordering on the input and output universes. This allows us to introduce a 
notion of distance between the fuzzy sets. Secondly, the input sets (antecedents, 
consequents and the observation) should be convex and normal fuzzy (CNF) sets. The 
method determines the conclusion by its a-cuts in such a way that the ratio of 
distances between the conclusion and the consequents should be identical with the 
ones among the observation and the antecedents for all important a-cuts (breakpoint 
levels). 

The KH interpolation possesses several advantageous properties. Firstly, it behaves 
approximately linearly in between the breakpoint levels. Secondly, its computational 
complexity is low, as it is sufficient to calculate the conclusion for the breakpoint 
level set. However, for some input situations, it fails to results in a directly 
interpretable fuzzy set, because the slopes of the conclusion can collapse [9]. To 
address this problem, improved fuzzy rule interpolation techniques [9,10] have been 
developed. While most fuzzy interpolation techniques perform analysis on one- 
dimensional input space, the improved multidimensional fuzzy interpolation 
technique [11] proposed in this paper will handle multidimensional input spaces. This 
has been applied for the development of the fuzzy hydrocyclone control model. 

5, Case Study and Discussions 

The test case described in (his paper incorporates the improved multidimensional 
fuzzy interpolation method for the development of an accurate hydrocyclone model. 
As mentioned before, there are a total of four input instances that cannot find any 
firing rules (refer to Figure 2). From the observation and Euclidean distance measured 
on each input variable, the nearest fuzzy rules of the four input instances are 
determined for use by fuzzy interpolation. 
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Comparison of the results from those generated by the previous fuzzy hydrocyclone 
control model, and the same fuzzy model with the improved multidimensional fuzzy 
interpolation technique are shown in Table 1. In order to show the applicability of this 
proposed fuzzy hydrocyclone model, the results are also used to compare with results 
generated from the on-line control model as shown in [2] . The graphical plots of the 
results generated from the model with the improved multidimensional fuzzy 
interpolation technique are shown in Figure 4. 

A few measurements of differences between the predicted d50c (7) and observed 
d50c (O) are used. They are: Euclidean Distance ED = -Oi ')^ ; Mean 

Character Difference Distance MCD = — — ; Percent Similarity Coefficient 

lmin(r,.,0;) 

PSC = 200 ^ . 

1 ( 7 ',- + 0 ,) 



Table 1: Com] 


parisons of results 


Model Type 


ED 


MCD 


PSC 


Formula from [2] 


59.787 


5.179 


90.096 


Fuzzy (no fuzzy 
interpolation) 


101.640 


6.409 


88.211 


Fuzzy (with fuzzy 
interpolation) 


52.97 


4.595 


91.889 




Figure 5: Output plot showing test case results with fuzzy interpolation. 
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From Table 1, the results show that the fuzzy hydrocyclone model performs 
unreasonably when no interpolation technique is used. This is mainly due to the four 
input instances that find no rule to fire and generate a default value of zero With the 
fuzzy rule interpolation technique, the number of fuzzy rules is not increased, but the 
prediction ability has improved. This is a desirable characteristic for on-line 
hydrocyclone control, as an increase in number of fuzzy rules would result in an 
increase in complexity which wtuld make the examination of the fuzzy rule base 
more difficult. 

6, Conclusion 

In this paper, the practical applicability of the self-generating fuzzy rule inference 
system in hydrocyclone control has been examined. The problem of sparse rule bases 
and insufficient input data may cause undesirable control actions. This is mainly due 
to input instances that could not find any rule in the fuzzy rule base. To provide a 
solution to this problem, the improved multidimensional fuzzy rule interpolation 
method has been applied. This method can be used to interpolate the gaps between the 
rules. This ensures that the set of sparse fuzzy rules generated by the self-generating 
fuzzy rule inference system will be useable in a practical system. 
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Abstract: in this paper, a data-driven fuzzy approach is developed for solving 
the motion planning problem of a mobile robot in the presence of moving 
obstacles. The approach consists of using a recent data-driven fuzzy controller 
modeling algorithm, and a devised general method for the derivation of input- 
output data to constmct a fuzzy logic controller off-line. The constructed FLC 
can then be used on-line by the robot to navigate among moving obstacles. The 
novelty in the presented approach, as compared to the most recent fuzzy ones, 
stems from its generality. That is, the devised data-derivation method enables the 
construction of a single FLC to accommodate a wide range of scenarios. Also, 
care has been taken to find optimal or near optimal FLC solution in the sense of 
leading to a sufficiently small robot travel time and collision-free path between 
the start and target points. 

1 Introduction 

In dealing with the motion-planning problem of a mobile robot among existing 
obstacles, different classical approaches have been developed. Within these 
approaches, we state the path velocity decomposition [1], [2], incremental planning 
[3], relative velocity paradigm [4], and potential field [5]. Soft-computing techniques, 
employing various learning methods, have also been used to improve the 
performance of conventional controllers [6], [7]. Each of the above noted methods is 
either computationally extensive or capable of solving only a particular type of 
problems or both. 

In order to reduce the computational burden and provide a more natural solution 
for the dynamic motion-planning (DMP) problem, fuzzy approaches, with emphasis 
on user-defined rules and collision-free paths, have been suggested [8]-[10]. 
Recently, a more advanced fuzzy-genetic-algorithm approach has been devised [11]. 
The emphasis has been not only on obtaining collision-free paths, but also on the 
optimization (minimization) of travel time (or path) between the start and target 
points of the robot. Genetic algorithms have, therefore, been used to come up with an 
optimal or near optimal fuzzy rule-base off-line by employing a number of user- 
defined scenarios. Although the noted fuzzy-genetic approach provided good testing 
results on scenarios some of which were used in training and others were not, it had 
its limitations. A different set of rules needed to be determined for every specific 
number of moving obstacles. 

The approach presented in this study considers the off-line derivation of a general 
fuzzy rule base; that is a base that can be used on-line by the robot independently of 
the number of moving obstacles. This is achieved using a recently developed data- 
driven learning algorithm for the modeling of fuzzy logic controllers (EEC’s) [12], 
and by devising a method for the derivation of the training data based on the 
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general setting of the DMP problem and not on specific scenarios. Furthermore, 
collision-free paths and reduction of travel time are still within the goals considered 
in the derivation of the FLC. 

2 Problem Definition, Constraints and Data Derivation 

In the DMP problem, a robot needs to move from a start point S to a target point 
G located in some quadrant where the moving obstacles exist. The purpose is to find 
an obstacle-free path which takes the robot from S to G with minimum time. A fuzzy 
logic controller represented by a set of inference rules is to be constructed such that 
when it is used by the robot and supplied with information about the moving 
obstacles, it provides decisions that enables the achievement of the stated objective. 
But, what kind of information needs to be supplied to the FLC and what kind of 
decisions it needs to provide? 

Let us first consider the incremental approach related to the motion of the robot 
and also considered in [11]. The robot, therefore, moves from one point to another in 
accordance with time steps, each of duration AT, and at the end of each step it needs 
to decide on the movement direction. Due to the problem objective, once the robot is 
at some point it needs to consider moving in a straight line towards the target point 
unless the information collected about the moving obstacles tells otherwise due to a 
possible collision. Hence, the information that needs to be obtained has to relate, in 
principle, to the position of each obstacle and its velocity relative to the robot 
position; i.e., the obstacle velocity vector. But, since the robot knows the position of 
each obstacle at every time step, an alternative to the use of the relative velocity can 
be the predicted position of each obstacle. This can be computed based on the 
obstacle present and previous positions. Ppredicteti is assumed the linearly extrapolated 
position of each obstacle from its present position Ppresent along the line formed by 
joining Pp^sentand Pprevious (see [1 1]). Thus, 

Ppredicted “ Ppresenl ( Ppresent “ Pprevious) 

But, to process all this information by the robot controller is difficult. The 
procedure that can be applied here, and which leads to a simplification of the 
controller structure, consists of using the collected information to determine the 
“nearest obstacle forward” (NOF) to the robot [11]. Then, only the information 
related to this obstacle is used by the FLC to provide decisions. The NOF is the 
obstacle located in front of the robot and with velocity vector pointing towards the 
line joining the robot position to the target point. In this way it needs to constitute the 
most possible collision danger relative to other obstacles whether the robot chooses 
to move straight to the target (Fig. 1). The NOF can equivalently be identified using 
the present and predicted positions of each obstacle. 

Therefore, what needs to be used are the present and predicted positions of the 
NOF. The position has two components; angle and distance. The angle is the one 
between the line joining the target to the robot position and the line between the robot 
and the NOF. The distance is the one between the robot and the NOF. The FLC 
output is the deviation angle between the target-point-robot line and the new 
direction of robot movement (Fig. 1). Based on the noted information the robot will 
be able to know whether the NOF will get close to or cross the line segment joining 
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the present position of the robot and the point it reaches after AT time if it moves 
straight to the target. This knowledge is in fact necessary for the determination of the 
angle of deviation. 



G 




0,=N0F 

GRO^=angle 

R 03 =distance 

GRD=deviation 



Figure 1. Illustration of NOF and angle, distance and deviation. 

But, to include all these variables in the conditions of the FLC will complicate its 
structure. It will also make the derivation of the input-output data points needed for 
the construction of the inference rules a difficult task. To make things simpler while 
maintaining the practicality of the problem, a contraint (constraint #4 below), which 
is not too restrictive, is considered in addition to other ones implied by the 
aforementioned problem description and adopted in [11]. 

1 . The robot is considered to be a single point. 

2 . Each obstacle is represented by its bounding circle. 

3 . The speed of each obstacle is constant with a fixed direction between its 
previous, present and predicted positions. 

4 . The distance traveled by the NOF in AT time is comparable to its diameter. 

Of course, constraint 3 presupposes that the obstacles do not collide while moving. 
Also, constraint 4 , with the problem configuration as depicted in Figure 2 and its use 
in the determination of the input-output data (see below), will reduce the number of 
FLC input variables to 2 ; predicted angle and distance. The present position of the 
NOF is still accounted for but not used explicitely in the controller conditions. 

Figure 2 considers a quadrant filled by side-to-side obstacles each of which may 
constitute the predicted position of the NOF. Suppose that the robot is in position R 
(present position) and the NOF predicted position is in (An, B13). The robot initial 
intension is to move straight to G if collision is deemed impossible. Otherwise, an 
angle of deviation needs to be determined. Due to constraint 4 , the present position of 
the NOF could be any of the neighboring obstacles such that the distance between the 
center of each of these obstacles and the center of (An, B^) is approximately equal 
to the obstacle diameter. The neigboring obstacles are roughly represented by (Aio, 
612)1 (All, B12), (Ai 2, B12), (Aio, B13), (Ai 2, B13), (Aio, B14), (All, B14), (A^, B14). Of 
course, if the segment between the present position of the robot and the point it 
reaches after AT time penetrates the square formed by the outer tangent lines to the 
noted 8 obstacles, a deviation from the straight line between the robot and target 
point is required. Otherwise, no deviation is necessary. The amount of deviation is to 
be specified based on having the robot move in a direction that is just sufficient to 
avoid hitting not only the predicted obstacle position, but also any of the 8 shown 
present positions of the obstacles and all obstacles in between. This is because the 
NOF might be slow compared to the robot speed and thus a collision danger might 
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exist. Among the two directions RRl and RR2, which lead to the avoidance of the 
obstacles positions, the one with the smallest deviation angle, i.e.; RRl is chosen. 
This serves the travel time reduction objective. 




Figure 2. A general configuration of the DMP problem used in the data derivation. 

We note also here that the obstacle diameter should not be very large since if a 
deviation is decided while the actual and predicted obstacle positions are not to 
demand such deviation (suppose for example that the present position of (An B 13 ) is 
(Aio, B 12 )) then the robot would not have moved far away from the direct path and 
the trajectory length would not increase significantly. This also serves the travel time 
minimization objective of the DMP problem. 

Now, based on the problem configuration in Figure 2 and the described general 
approach for the determination of the necessary deviation for every possible pair of 
predicted distance and angle of the NOF, various locations of the NOF within the 
noted quadrant were considered and accordingly input-output data were derived. The 
locations of the NOF were selected so that the input pairs cover the input space 
adequately. This is necessary for the construction of the FLC using the data-driven 
algorithm in [ 12 ]. 

It needs to be mentioned here that attempts were made to derive the data pairs by 
considering scenarios each containing a specific number of obstacles [11]. This 
approach was concluded difficult to use since an adequate coverage of the input 
space was not within reach. The derived data points are shown in Table 1. These are 
obtained based on obstacle diameter equal to 0.5 meters and robot traveled distance 
in AT time equal to 2 meters. 

3 FLC Construction 

The data points in Table 1 were used in the learning algorithm introduced in [12] and 
a set of inference rules (Table 2) was obtained using the input and output 
membership functions (MF’s) shown in Figure 3. Actually, the algorithm, which 
relies on the use of a parametrized defuzzification strategy [13], operates based on a 
consistent modification of the defuzzification parameter and initial rules consequents 
to reduce the data approximation error and obtain a final fuzzy system. The initial 
rules consequents are required to be equal to the left-most fuzzy output (i.e., VI in 
Figure 3) and the membership functions assigned over the controller variables need to 
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Table 1. Input-output data pairs obtained using the method described in Section 2. 



Distance 


Angle 


Deviation 


Distance 


Angle 


Deviation 


Distance 


Angle 


Deviation 


Distance 


Angle 


Deviation 


0.25 


-24 


80 


0.8 


11 


■33 


1.45 


-29 


0 


2 


70 


0 


0.25 


-12 


70 


0.8 


22 


■22 


1.45 


33 


0 


2.05 


-12 


7 


0.25 


30 


■65 


0.8 


75 


0 


1.5 


5 


-18 


2.05 


16 


-5 


0.25 


43 


-80 


0.8 


107 


0 


1.6 


-80 


0 


2.2 


■25 


0 


0.3 


-41 


70 


1 


-43 


0 


1.6 


-15 


10 


2.25 


9 


-10 


0.35 


50 


■35 


1 


4 


-30 


1.6 


22 


-3 


2.35 


-21 


0 


0.35 


139 


0 


1 


90 


0 


1.6 


80 


0 


2.5 


-10 


0 


0.4 


49 


0 


1.1 


-25 


10 


1.75 


-6 


15 


2.5 


3 


0 


0.45 


-90 


0 


1.1 


30 


■3 


1.75 


11 


-7 


2.55 


-30 


0 


0.5 


-90 


0 


1.1 


48 


0 


1.8 


-32 


0 


2.8 


-4 


0 


0.5 


-20 


55 


1.2 


65 


0 


1.8 


37 


0 


3 


-70 


0 


0.5 


■5 


70 


1.25 


-9 


18 


1.9 


-22 


0 


3 


-43 


0 


0.5 


5 


■70 


1.3 


-90 


0 


1.9 


27 


0 


3 


70 


0 


0.5 


20 


■55 


1.3 


15 


■10 


2 


-70 


0 


3.2 


13 


0 


0.7 


-43 


12 


1.3 


80 


0 


2 


3 


-16 


3.25 


16 


0 


0.8 


-17 


30 


1.4 


48 


0 


2 


30 


0 


5 


-43 


0 



cover the ranges of these variables. The ranges of the distance, angle and deviation as 
considered are from 0 to 17 meters, -180 to 180 degrees and -90 to 90 degrees 
respectively. These ranges are considered to account for all possible values. A 
modification of the number of MF’s, their shapes and the density of their coverage 
of the FLC variables can be done by the designer to lower the data approximation 
error. The distribution of the input MF’s in Fig. 3, which is reasonable since most 
robot deviations occur at small angles and distances, was verified by the learning 
tests to serve the error reduction goal. 


















Figure 3. Input and output MF’s used in learning. 



Table 2. Final fuzzy system obtained by learning. 
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4 Testing Results 

In this section, the obtained FLC is tested on various scenarios containing 
different numbers of obstacles. The cases of 3, 5, and 2 cases of 8 obstacles are 
considered (Figure 4). In all the cases, the robot travels from point S to point G 
without hitting any of the obstacles. Also, the traveled paths are optimal in the sense 
that the deviations which took place at the end of every time step are in most cases 
just as necessary in order for the robot to remain as close as possible to the robot- 
destination direct path while not colliding with the obstacles [11]. Moreover, two of 
these scenarios (Figures 4(a) and 4(d)) were presented in [11] and had obstacles with 
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distinct diameters. Some had diameters close to the one considered in this study, and 
others with larger diameters. Despite this, the robot path chosen by the constructed 
FLC does not hit any of the obstacles. This shows that the constructed FLC can work 
properly for obstacles whose diameter values differ from the one used in the data 
derivation. Of course, a significant increase in the diameters would make the chances 
of having the robot hitting the obstacles higher. The clarification of this fact can be 
obtained by referring back to the described data derivation approach. Also, the case 
in Figure 4(a) shows the robot passing tangentially by the current position of one of 
the obtacles. A hit might have occurred had the diameter been larger. Thus, a set of 
data points different from that in Table 1 needs to be determined for different 
obstacles diameters. The same applies for the distance traveled by the robot in AT 
time. But the presented data derivation approach is general and can still be used. 







Figure 4. Paths traveled by the robot in 4 scenarios: (a) 3 obstacles, (b) 5 obstacles 
and (c) and (d) have 8 obstacles each. 

Table 3 shows the distance ratio (traveled distance/ direct distance) using the 
presented data-driven fuzzy approach and the fuzzy-genetic one. The ratio in our 
approach is a bit higher than that used in [11]. Thus, a slightly higher time duration is 
required for the robot to reach destination. 

Table 3. Traveled distance and ratios for the presented approach and the fuzzy-genetic one. 



OBSTACLES 


3 


5 


8 


8 


DIRECT DISTANCE 


14 


14 


20 


20 


TRAVELED DISTANCE 


15.25 


15.35 


20.9 


21.7 


RATIO(OUR APPROACH) 


1.089 


1.096 


1.045 


1.085 


RATIO(GENETIC) 


1.046 






1.05 



5 Conclusions 

A data-driven fuzzy approach has been developed in this study to provide a 
general framework for solving the DMP problem under some constraints. It consisted 
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of providing a general and well designed method for the derivation of input-output 
data to construct a single FLC that can be used by the robot to guide its navigation in 
the presence of moving obstacles and independently of the number of these obstacles. 
This is the main advantage of the presented approach over the most recent fuzzy- 
genetic one. In a general robot environment, it is difficult to guess on the number of 
obstacles that could facing the robot while in navigation. From this perspective, the 
results in Table 3 are quite acceptable. As compared to other fuzzy approaches [8]- 
[10], The presented approach is systematic and does not employ user-defined rules, 
which are mostly derived by trial and error. The devised method has also accounted 
for collision-free paths and reduction of travel time while lessening the number of 
FLC variables and hence structure. A recently developed data-driven fuzzy 
controllers modeling algorithm has been used to construct the FLC. 
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Abstract. Bayesian Ying-Yang (BYY) learning is proposed as a unified 
statistical learning framework firstly in (Xu. 1995) and systematically de- 
veloped in past years. Its consists of a general BYY system and a funda- 
mental harmony learning principle as a unified guide for developing new 
parameter learning algorithms, new regularization techniques, new model 
selection criteria, as well as a new learning approach that implements pa- 
rameter learning with model selection made automatically during learn- 
ing (Xu, 1999a&b; 2000a&b). This paper goes further beyond the scope 
of BYY learning, and provides new results and new understandings on 
harmony learning from perspectives of conventional parametric models, 
BYY systems and some general properties of information geometry. 



1 Introduction 

Specifically. RYY learning with specific structure designs leads to three major 
paradigms. First, the RYY unsupervised learning provides a number of new re- 
sults on several existing major nnsnpervised learning methods [1-5], the details 
are partly given in [15,14,10] and a review is given in [10]. Second, the RYY 
supervised learning provides not only new understanding on three major super- 
vised learning models, namely three layer forward net with back- propagation 
learning. Mixture Expert (ME) model [6,7] and its alternative model as well as 
normalized radial basis function (RRE) nets and its extensions [13], but also new 
adaptive learning algorithms and new criteria for deciding the number of hid- 
den units, of experts and of basis functions, with the details referred to [10-13]. 
Moreover, the temporal RYY learning acts as a general state space approach 
for modeling data that has temporal relationship among samples, which pro- 
vides not only a unified point of view on Kalman filter. Hidden Markov model 
(HMM). ICA and blind source separation (RSS) with extensions, but also sev- 
eral new results such as higher order HMM, independent HMM for binary RSS, 
temporal ICA and temporal factor analysis for noisy real RSS, with adaptive 
algorithms for implementation and criteria for selecting the number of states or 
sources [9] . 

In this paper, we go beyond the scope of RYY learning and study the harmony 
learning principle from perspectives of conventional parametric models, RYY 
systems and information geometry. 

* The work described in this paper was fully supported by a grant from the Research 
Grant Council of the Hong Kong SAR (project No: CUHK4383/99E). 
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2 Best Harmony Learning Principle 

Given two models p(u) = p(u\0p, kp) , q[u) = q[u\0g, kg), with 0 = 0p,0g con- 
sisting of all the unknown parameters in each of the models, and k — {kp,kg] 
consisting of integers that indicate structural scales of the models, onr funda- 
mental learning principle for estimating the parameter 0 and selecting the scale 
k is the best harmony in a twofold sense: 

- The difference between the obtained p(u), q(u) should be minimized. 

- The obtained p(u), q(u) should be in the least in their complexity. 

Mathematically, we use a functional to measure the degree of harmony be- 
tween p(u) and . 9 ( 1 /). When both p(u), q(u) are point densities in the form 

7(") = ^ ( 1 ) 

such a measure is simply given as follows: 

-^(pII. 9) = I] tPt In. 9t, (2) 

It can be observed that when p = q we have H(p\\p) which is the negative 
entropy of p. More interestingly, the maximization of H(p\\,q) will not only push 
p, g towards to pt = qt but also push p(u) towards the simplest form 

p(u) — S(u — Ur), with T — arg max gt, (3) 

or equivalently — 1 , and pt = 0 for other t, which is of least complexity from 
the perspective of statistical theory. Thus, the maximization of the functional 
indeed implements the above harmony purpose mathematically. 

When g(ii) is a continuous density, we approximate it by either simply its 
sampling point g(ut) or its normalized version qt = g('ik)/^tHi^t)- Furthermore, 
when the set Li has a large enough size N with samples uniformly covering the 
entire space of u such that we can approximately regard that g(ut) is the value of 
the corresponding histogram density at ut within the hyper-cubic bin , 

where /) > 0 is a very small constant and is the dimension of u. Then, we 
have 

~ (4) 



Including the discrete case, we have a general form 



gt = 



gjttt) 

2 g 



Zg 



limdu->o , 
] 



(a) g{it) is a point density, 

(b) g{it) is pointized by its sampling points, 

(c) g{ii) is pointized and normalized, 

(d) g{ii) is approximated via a histogram. 



(5) 



where iy[.) is a given measure on u. Particularly, for the Lebesque measure, 
iy[dn) is the volume of du. 

Putting eq.(5) into eq.(2), we have 



f^(.P\\g) = - \nzg. 



( 6 ) 
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In other words, given Li we can approximately nse eq.(6) in place eq.(2) for 
implementing the best harmony learning. 

We can also approximate a continuous p(u) in the form eq.(5) and get 
'^^^p[ut)\ng[ut) / Zp. Then, similar to eq.(4) we have In (/(«*) 

[p(u)]n g(u)L>(du) , which is actually also true when p{n) is a discrete density. 
Thus, a generalized form of the harmony measure is given as follows: 

^(p||fl) = J p{u)\n g{u)iy{du) - In Zg, (7) 

with Zg given in eq.(5). Therefore, we implement harmony learning by 

max H{6,k), H{6,k) ^ H{p\\g). (8) 

6>,k 



3 Parameter Learning: ML, Posteriori WTA and Two 
Regnlarization Methods 



Provided that k is prefixed, we consider typical cases for determining 0 or called 
parameter learning, on both the simple parametric model and comprehensive 
RYY system. 

Parameter learning on simple parametric models Given a data set 
U — and a parametric model g[u\6), the task now is to specify the value 

of 6 such that g{u\9) well fits U. 

Learning from the set Li is equivalent to learn from its empirical density: 



p>(u) ^pjo(u), po{tl) 






r limdu->o l/i^(dn), a = 0, 

to, It ^ 0, 



(9) 



where i>(.) is same as in eq.(5). Putting this p(u) and g(ti\0) into eq.(7), for the 
cases (a)&(b) in eq.(5) as well as case (d) at fixed h,d, maxg H(p\\g) becomes 
equivalent to 



maxL(/9), L(/9) = ^ Jn g(nt |/9), 



( 10 ) 



which is exactly the Maxim, um Likelihood (ML) learning on g{u\0). 

While for the case (c), we have maxg Li(p\\g) becomes 

max Lr{6), L_r(/ 9) = L(/9) - In |/9)], (It) 



which consists of ML learning pins a regnlarization that prevents g(u\0) to 
overfit a finite size data set Li. This point can be better observed by comparing 
the gradients: 



VeL(6) = Gd(7dl-,,.i, ^oLr(6) = Gd(70l-„. , 

Gd( 7 t) = I]„^7tVelng(nt|/9), g[ut\d) ^ g{ut\d)lY,^g{T\d). (12) 

It follows from \7gLn(0) that a de- learning is introduced to ML learning for 
each sample in proportional to the current fitting of the model to the sample. 
We call this new regularization method normalized point density. 
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Another new regularization method can be obtained from the case (d) in 
eq.(5) with p(u) not given by eq.(9) but by the Parzen window estimate: 

p(»') Pa(n) = (13) 

where G(u\m, U) denotes a ganssian of mean m and covariance matrix U, G is 
the dx d identity matrix, and d, is the dimension of x. As rr — 0, G{u\ut, Id) 
S(u), and Pa(ii) returns to po(ii). Generally, Pa(ii) is a smoothed modification of 
Po(u) by using a kernel G{u\0, Id) to blur out each impulse at Ut- 

Moreover, similar to eq.(4) we have 1 . It further follows from 

Ph{ut) G{ut\ut,(r‘^Id) — that h = \/^a . Putting it into eq.(7) with 

g(ii\0) and p(u) = ph(ii), we consider the case (d) and get that maxg II[p\\q) 
becomes equivalent to 

rnax Ls{6,a'^), Ls{6, f Id)ln g(ulff)du + Ina, (14) 

which regularizes the ML learning by smoothing each likelihood term \nq{ut\0) 
via G(u\ut, Id) in the near-neighbor of Ut- We call this new regularization 
method data smoothing. 

We further get jG{u\ut,o-‘^Id)^ng[u\9)du Ps lng(r/t|0) + O.5iJ^Tr[Hg[itt\0)], 
where Hg[ut\0) is the Hessian matrix of lng(u|0) with respect to u. Putting it 
into eq.(14) we can further get 

= Ndn/'^^Tr[Hg(ur\6)]. (15) 

Thus, LsiOjCr^) consists of the likelihood L(9) plus the regularization term 
O.hdulniT^ +0.5(7^-^ ^jTr[i7g(r/t|0)] under the constraint of eq.(15). 

We can simplify the implementation of eq.(14) by alternatively repeating the 
following two steps: 

Step 1 : fix 6, get by eq.(15). Step 2 : fix , get -|_ qA6, (16) 

where ?) > 0 is a small learning stepsize and A9 is an ascend direction of 
J G[u\uty Id)^n g[u\9)du, which can be approximately obtained by ran- 
domsampling, i.e., from \ng[u[\9) with a new data set u[ — Ut+£t 

and £t being a sample from G(?(|Q, rr^Id)- 

Parameter learning on BYY learning system When u= [x,y] con 
sists of a: — [ 2 : 1 , • • • , XdJ^ which is observable and y — [?/],•••, which is 
not observable. We can not get p>(u) by either eq.(9) or eq.(13) directly. Such 
cases are widely encountered in intelligent systems. RYY system is proposed 
for these purposes as a unified statistical learning framework firstly in 1995 [16] 
and systematically developed in past years [8,10,11]. In the RYY system, we 
describe p(u),g(u) on x G X,y G Y in help of two complementary Bayesian 
representations : 

p(tt) = pM^^^(y\x)pM^(x), g(u) = pM^^^(x\y)pMy(y)■ (17) 

In this formulation, p(u) is called Yang model, representing the observable or 
called Yang space by pM^ and the pathway x ^ y hy PMyi^ is called Yang or 
forward pathway, while g(ii) is called the Ying model that represents the invisible 
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state space or Ying space by pM^ and the Ying or backward pathway y ^ x hy 
Such a pair of Ying-Yang models is called Bayesian Ying-Yang (BYY) 
system. The use of such terms is because the formalization eq. (17) compliments 
to the famous eastern ancient Ying-Yang philosophy, as discussed in [10]. 

Now we have four components and Given a set of 

observable samples PM^ is still given by either eq.(9) or eq.(13) with 

each appearance of ?/, Ut replaced by x, Xf. pM^ is a parametric model designed 
according to the representation form of y in different applications, most of which 
have been summarized into a unified framework called Y-Tl model [8,9]. Each of 
can be two typical types. One is parametric with a set of unknown 
parameters, i.e., for or for The other is called structure-free, 

which means no any structural constraints such that pM^ for each a G {x\y, y\x] 
is free to take any element of Va , where Vx\y and Vy\x denote the family of all the 
densities in the foim p(x\y) and p(y\x), respectively. As a result, there are three 
typical architectures, namely backward, forward, and hi- direction at architectures 
[10,8,9], 

Putting eq.(17) in eq.(7) and eq.(8), with eq.(5) we can obtain the corre- 
sponding details for implementing parameter learning on RYY systems. More 
specifically, when pM^ is given by eq.(9) and is either a point density for a 

discrete y or approximated in help of eq.(5) via a set of sampling point 
For the choices (a) & (b) of Zg in eq.(5), we have that H (p\\g) takes the following 
specific form: 

ff(p||.9) = L{Sx\,j) + L{6y), L{6x\,j) = 

_ ( irmXy[pM^^„{xt\y)pMy{y)], (a) when PM„|^ is structure-free, 

^ ) maxj, pm„I^ (y|rt), (b) when Pm„|^ is parametric; 1 ' 



The case (a) is actually in the same form as the case (b) since the maximization 
of H (p\\g) will lead to 



PM„.Jy\xt) = 



P^^iJ^t\y)pMjy) 

PM{xt) 



, pM(xt) = fpM^^„(xt\y)pM„(y)n(dx), (19) 



whenpM„l^ is structure-free. Thus, max^^ (t/|a:t) = max^^ [pM^l„ (a:* |t/)PM„ (t/)] 

and yt is actually obtained via the competition based on the posteriori distribu- 
tion Pm„|^ and thus is called posteriori winner-take- all (WTA ). Correspondingly, 
the learning eq.(8) in help of this competition is called posteriori WTA learning. 
Particularly, for the case (a), the competition is made coordinately via the gener- 
ative model Pm^|„ and the representation model pM„, we also call it coordinated 
competition learning (CCL) in [13]. 

Tn implementation of eq.(8) with eq.(18) at fixed k, the parameter 0x\y,dy 
can be updated via gradient ascending L( 6 x\y), L(dy) respectively. There is no 
need to update Oy^x for the case (a), while for the case (b), we consider those 
models of Pm„|^ such that 



yt 



■ max Pm , 



,{y\xt)^ / yPM„.{y\xt)n{dy) ^ f{xt\6y^x 
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thus Vo^,^H{p\\g) = ^(y) = (20) 

with which we can make gradient ascend updating on 6y\^ also. 

Furthermore, when we consider the choice (c) of Zg in eq.(5) and the choice 
(d) of Zg in eq.(5) with pM^ given by eq.(13), we can also get the two new regu- 
larized learning methods, namely normxilized point density and data smoothing, 
in various specific forms for the different specific designs of PMyi,^ and PM^iy- 
Readers are referred to [8] for details. 

4 Model Selection and Antomated Model Selection 

The task of selecting the best specific value of k is usually called model selection 
since different values correspond different scale of model architectures. It follows 
from eq.(8) that searching a best k must be made together with parameter 
learning for searching a best 6. Actually, the task eq.(8) is a typical optimization 
mixed with 0 of continuous parameters and k of discrete variables. 

We have the following two procedures for its implementation: 

• Parameter learning followed by model selection We enumerate k 
from a small scale incrementally, which results in a number of specific settings 
of k by gradually increasing the scales. At each specific k we perform parameter 
learning as in Sec. 3 to get the best 6^, and make a selection on a best k* by 

max./(k), J[k) — H[6l,k). (21) 

k 

That is, the entire process consists of the two steps: the step of determining 6* 
followed by the step of selecting k*. Moreover, as further justified in Sec. 5, the 
step of determining 6* can be replaced by 

min KIj[ 6), K L[6) — K L[p\\g) — Jp(a)ln ^| l^ v[du), 

KL{p\\g) ^ H{p\\p) - H{p\\g), ' (22) 

where H (p\\p) is a special case of eq.(7) with g(u) replaced by p(u). We have 
KL(p\\g) — 0 when p(ii) = q(n) and KL(p\\q) > 0 when p(n) q(ii)- Thus, the 
minimization pushes p(u), q{u) to best match, i.e., it realizes the first purpose of 
harmony learning. However, it has no consideration on the second purpose, i.e., 
forcing the least complexity, of harmony learning. There may be some discrep- 
ancy between the resulted 6* by the two ways. However, this discrepancy will 
be further reduced by the same subsequent step of model selection eq.(21). 

• Parameter learning with antomated model selection Which is 
another advantage that the harmony learning eq.(8) has but the KL-learning 
eq.(22) has not. In the cases that setting certain elements in 0 to certain specific 
values (e.g., 0) becomes equivalent to reduce k form a higher scale into a lower 
scale, we can prefix all the integers in k to be large enough and then simply 
implement parameter learning maxg H[0,k) as in Sec. 3, in which there are 
forces for both best fitting and the least complexity and thus will effectively 
reducing the scales of k. 
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For the learning on RYY system given by the case eq.(18), the posteriori 
WTA may cause many local maximnms which will affect the performance. A 
solution to this problem is 



mm [XH{p\\p) - H{p\\g)], (23) 

with A starting at A — I, which is equivalent to eq.(22) and then gradnally re- 
dncing into zero, which is equivalent to maxg H (0, k). Snch an annealing process 
can be regarded as a regnlarization on the harmony learning maxg H(0, k). 

This annealing process becomes nnnecessary when maxg H(0, k) is imple- 
mented with regnlarization via either normalized point density or data smoothing. 

5 Harmony Measure, KL-Divergence and Geometry 

We can get further insights on the harmony learning and the KL learning eq.(22) 
from understanding the geometrical properties of the harmony measnre and the 
KL-Divergence. 

We start at reviewing some basic properties in the conventional vector space 
R^. We denote Uc ~ {n : u £ and ||r/|P — c, for a constant c > 0). For 
n G Uc, r G Uc' , the projection of r on i/ is 

u'^v — — Vcc' cos (f>, (f) is the angle between the direction of u and r(24) 

with the following properties: 

• The self-projection of u to u is simply the norm ||u|p. 

• The projection it^v is maximized when (f) — Q, i.e., v is co-directional with u. 

• When c — c' , <j) — [] implies u = v, i.e., the projection is maximized if and 

only if u — V. 

• <j) — [] maybe achieved by rotating the directions of both n and v or the direction 

of either u or v. 

Using u to represent v, the error or residual u — v has a projection on u: 

(u — v)^tt — ||n||^ — v^u — c — v^tt. (25) 

with the following properties: 

• Its minimization this residual projection is eguivalent to maximizing the pro- 

jection u^v. When v'^ u > 0, i.e., 0 < cj) < Q.Stt, this residttal projection 

[u — v)'^u is actually the difference between the norm, of u and the projection 
T 

U V. 

• The residual u — v is said to be orthogonal to u when the residual projection 

[u — v)'^u becomes zero, where the norm of u and the projection u'^v becomes 
the same, i.e., u^ v — ||r/|p — c or — cos (f). 

• When c — c' , the minim.um value of [u — v)'^ u is 0 which is reached if and 

only if u — V. 
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Therefore, we see that maximizing the projection v and minimizing the 
residual projection (u — v)'^u are two complementary bnt equivalent concepts. In 
general, the minimization of the residual projection (?/ — v)'^ u nsnally does not 
resnlt in the orthogonality of tt — t) to t/ since the minimnm value of [u — v)'^ u 
can be negative. However, when c — c' , this minimnm is 0 and thus in this case 
the concepts of maximizing the projection v for the co-directionality of v to 
u, of minimizing the residual projection (« — v)'^ u, and of making residual u — v 
being orthogonal to u are all equivalent. 

In an analogy, we consider a fnnctional space 

G ~ {.g(n) : g{tt) > Q and J g{u)ij{du) < ooj, (26) 

where ?/ G C R‘^ and /r is a given measnre on the snpport S^, and ld{dy) 
relates to du only bnt to neither u nor g(u). A nsefnl snbspace Vc C G is 



Va ~ {p(n) : p(a) > 0 and 



p{u)ij{du) — c. 



for a constant c > 0 }. 



(27) 



Particnlarly, when c= l,Vi is the probability density space. 

Given p(n) G Vc,g(ii) G Vc' , we define the projection of q(u) on p(n) by 

ff(p||fl) — j p{'^)p{du)\n{g{u)p{du)) = j' p{u)\n g{u)p{du) -\~\n p{du), (28) 

where p(dn) takes the same role as z~^ in eq.(7) with Zg given in eq.(5). In 
correspondence to eq.(24), we have the following properties: 

• The self-projection of p[u) to p[u) is H[p\\p) — J p[it)p(du)\n[[p[it)p[dit)], 

which can be regarded a type of norm of p and it becomes the negative entropy 
of the probability distribution p(u)p{du) when p(u) G Vi is a density. 

• H[p\\g) is mnximized if and only if g{u) — ^p{u), i.e., g[u) has the same shape 

as p(u), because we always have J p(u)]n g(u)p(du) < J p(u)]np(u)p(du) 
with cp{u) —p{u), c'g{u) = g{u) and p{u), g{u) G V\. 

• When c = c' , H (p\\g) is maximized if and only if g{u) = p[u). 

• When p(u) is free to be any choice in Vc: Idle maximization of H{p\\g) will 

also let p(ii) to become cS[u — Up), where Up — argmax^ g(u). 



The last property is quite different from the situation of eq.(24), where when 
both the directions of u and v are free to change, the maximization of the 
projection iF y only ensnres u and v being in a same direction bnt does not 
impose any specific preference for the direction of u. The maximization of H(p\\g) 
makes not only that p{u) and g{u) has a same shape in the sense g[u) — ^p[u) 
bnt also that p(u) prefers to have a simplest shape cS(u — Up). Therefore, when 
p(u) is free to be any choice in Vc and g(u) is free to be any choice in Vc' , the 
maximization of Tf(p\\g) will finally let that both p(u) and g(u) become impulse 
functions with different scales bnt located at a same point Up that can be any 
point in R‘^. When p{u) G P,q{u) G Q are constrained to be not able becoming 




124 L. Xu 



impulse functions, the maximization of H(p\\g) will make that p(u) and . 9 ( 1 /) 
become not only probabilistically simple and but also close in shape. 

If we use p(u) G Vc to represent g{ii) G Vc' and define the discrepancy or 
residual ^ hy p{u) Q g{u) — p{u)p{du)/[g{u)p{du)] — p{u)/ g{u) , with g{u)p{du) 
in eq.(28) replaced by the residual in this representation, we can find that the 
residual projection on p(u) is 

^(p||fl) = /p(^)ln[p( 2 i)/.g(n)]p(dn) = H{p\\p) - H{p\\g). (29) 

Since p(u) = cp(u), g(u) = c'g{u) with p(u),g{u) GV-\, it follows that 

R{p\\g) = c[A'L(p||§) + In ^], (30) 

From which we can observe the following properties: 

• Minimizing R(p\\g) is equivalent to minimizing the self-projection of p{u) and 

maximizing the projection of g{u) on p(u). When the self-projection H(p\\p) 
is fixed at a constant, minimizing the residual projection is equivalent to 
mxiximizing H(p)\\g). 

• Since H(p\\g) > 0, R(p\\g) is actually the difference between the norm of pj and 

the projection of q on p. 

• The residual p(u)Qg(u) is said to be orthogonal to p(u) when the residual pro- 

jection R(p\\g) becomes 0 that happens when the norm of p and the projection 
of g on p become the same, i.e., H(p\\p) = H(p\\g). 

• When c — c' , the minimum value of R(p\\g) is 0 which is reached if and only 

if p{u) — g(u). Moreover, when c — c' — 1, p(u) and g{u) are densities and 
R{p\\g) = KL{p\\g). 

The concepts of maximizing H[p\\g) for co-directionality and minimizing the 
residual projection R[p\\g) for orthogonality are complementary and closely re- 
lated, but not equivalent even when c = c' = 1 , which is not exactly the same 
situation as in that for eq.(24) and eq.(25). Specifically, minimizing R[p\\g) only 
makes p{n) and g{u) become as close as possible in shape but, being different 
from maximizing H[p\\g), has no force to compress p{n) to become as close as 
possible to an impulse function. However, it follows from eq.(29) that the two 
become equivalent under the constraint that H{p\\p) is fixed at a constant Hq, 
that is, we have 

max Rip\\g) is equivalent to min Rip\\g)- (31) 

pGP,qGQ, s.t. H{p\\p) = Hq pGP,qGQ, s.t. H{p\\p) = Ho 

^ Under this definition, p{^i)Qg{u) is generally not guaranteed to still remain in Q. For a 
subset Qg C Q with Qg = {g{n) : g{u) G G, g^{u)p{du) < oo, g~^{u)g{du) < 
oo, g(du) < oo}, we can define the addition by r(n) = p(tt) ® g(ii) = p(ii)g(ii) 
and have r(n) G Qg- Also, we have the unit 1 = p(n)p~^(n) G Qg for G Su and 
the inverse p~^{u) = l/p{u) G Qg- In this case, it follows that the induced minus 
operation p{u) 0 g{u) = p(n)/^(n) is still in Qg. That is, we get Qg as an Abel 
group. Moreover, on an appropriate subset Qi we can further define the dot product 
a 0 p{u) = p{u)^ G Qi for a ^ R and thus get Qi as a linear functional space. 
Furthermore, we can introduce the geometrical concepts of the projection eq.(28), 
the residual projection eq.(29) and the corresponding orthogonality to Qg,Qi- 
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Moreover, the minimum value of R{p\\g) again can be negative and thus the 
minimization of R[p\\g) generally does not result in the orthogonality of the 
residual p{u) 0 g{ii) to p{y). However, this minimum is 0 when c = c', in which 
case the concepts of minimizing the residual projection R{p\\g) and making p(?i)0 
g{u) being orthogonal to p{u) are equivalent. 
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Abstract. Observational learning algorithm is an ensemble algorithm 
where each network is initially trained with a bootstrapped data set 
and virtual data are generated from the ensemble for training. Here we 
propose a modular OLA approach where the original training set is par- 
titioned into clusters and then each network is instead trained with one 
of the clusters. Networks are combined with different weighting factors 
now that are inversely proportional to the distance from the input vec- 
tor to the cluster centers. Comparison with bagging and boosting shows 
that the proposed approach reduces generalization error with a smaller 
number of networks employed. 



1 Introduction 

Observational Learning Algorithm(OLA) is an ensemble learning algorithm that 
generates “virtual data” from the original training set and use them for training 
the networks [1] [2] (see Fig. 1). The virtual data were found to help avoid over- 



[Initialize] Bootstrap D into L replicates D\, . . . Dl- 
[Train] 

Do FOR t = 1, . . . , G 

[T-STEp] Train each network : 

Train network f* with D* for each j E {1, L} . 
[O-STEP] Generate virtual data set Vj for network j : 

Vj = {( x ' , y ')\ x ' = X + e,e ~ N ( 0 , E),x 6 Dj , 

y' = Ehi A' = lAI- 

Merge virtual data with original data : 

D‘+^ = Dj U Vj . 

End 

[Final output] Combine networks with weighting factors p’s : 

fcomix) = (*) where /3j = 1/L. 



Fig. 1. Observational Learning Algorithm (OLA) 



fitting, and to drive consensus among the networks. Empirical study showed that 



K.S. Leung, L.-W. Chan, and H. Meng (Eds.): IDEAL 2000, LNCS 1983, pp. 126-132, 2000. 
© Springer- Verlag Berlin Heidelberg 2000 




Observational Learning with Modular Networks 127 



the OLA performed better than bagging and boosting [3] . Ensemble achieves the 
best performance when the member networks’ errors are completely uncorrelated 
[5]. Networks become different when they are trained with different training data 
sets. In OLA shown Fig. 1, bootstrapped data sets are used. Although they are 
all different, they are probabilistically identical since they come from the identi- 
cal original training data set. In order to make them more different, we propose 
to “specialize” each network by clustering the original training data set and us- 
ing each cluster to train a network. Clustering assigns each network a cluster 
center. These centers are used to compute weighting factors when combining 
network outputs for virtual data generation as well as for recall. 

The next section presents the proposed approach in more detail. In Sections 3 
and 4, experimental results with artificial and real-world data sets are described. 
The performance of the proposed approach is compared with that of bagging and 
boosting. Finally we conclude the paper with a summary of result and future 
research plan. 

2 Modular OLA 

The key idea of our approach lies in network specialization and its exploitation 
in network combining. This is accomplished in two steps. First is to partition the 
whole training set into clusters and to allocate each data cluster to a network. 
Second is to use the cluster center locations to compute the weighting factors in 
combining ensemble networks. 

2.1 Data set partitioning with clustering 

The original data set D is partitioned into K clusters using K-means clustering or 
Self Organizing Feature Map (SOFM). Then, a total of K networks are employed 
for ensemble. Each cluster is used to train each network (see [Initialize] section of 
Fig. 2). Partitioning the training data set helps to reflect the intrinsic distribution 
of the data set in ensemble. In addition, exclusive allocation of clustered data sets 
to networks corresponds to a divide-and-conquer strategy in a sense, thus making 
a learning task less difficult. Partitioning also solves the problem of choosing the 
right number of networks for ensemble. The same number of networks is used as 
the number of clusters. The problem of determining a proper number of ensemble 
size can be thus efficiently avoided. 

2.2 Network combining based on cluster distance 

How to combine network outputs is another important issue in ensemble learning. 
Specialization proposed here helps to provide a natural way to do it. The idea 
is to measure how confident or familiar each network is for a particular input. 
Then, the measured confidence is used as a weighting factor for each network 
in combining networks. The confidence of each network or cluster is considered 
inversely proportional to the distance from input vector x' to each cluster center. 
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[Initialize] 

1. Cluster D into K clusters, with K-means algorithm or SOFM, 

Di, D 2 , . . . , Dk with centers located at Ci,C 2 , ■■■ , Ck, respectively. 

2. Set the ensemble size L equal to the number of clusters K . 

[Train] 

Do FOR t = 1, . . . ,G 

[T-STEP] Train each network : 

Train network /j with D* for each j G {1, . . . , L}. 
[O-STEP] Generate virtual data set for each network j : 

Vj = {{x' , y')\x' = a: + e, e ~ N{0, E),x G Dj, 

y' = - "^here fdj = l/dj{x') 

and dj(x') = ^ {x' - CjYE~'^{x' - Cj) }. 
Merge virtual data with original data : 

= Dj\J V*. 

End 

[Final output] Combine networks with weighting factors f} ’s : 
f comix) = Pjff (x) where I3j = l/dj{x'). 



Fig. 2. Modular Observational Learning Algorithm (MOLA) 



For estimation of the probability density function(PDF) of the training data set, 
we use a mixture gaussian kernels since we have no prior statistical information 
[6] [7] [8]. The familiarity of the kernel function to input x' is thus defined 
as 

Taking natural logarithms of Eq. f leads to 

\ogO,{x') = -^\og\E,\-^{x'-Cj)^E-\x'-Cj), ij = l,...,L). (2) 

Assuming Ej = Ej, for j ^ j' , makes a reformulated measure of the degree of 
familiarity dj{x'), 

logOj{x') oc dj[x') (3) 

where dj{x') = ^ [x' — Cj)'^ E~^(x' — Cj). 

So, each network’s familiarity turns out to be proportional to be negative Ma- 
halanobis distance between an input vector and the center of the corresponding 
cluster. The network whose cluster center is close to x' is given more weight in 
combining outputs. The weighting factor fdj is defined as a reciprocal of the dis- 
tance dj{x') and f3j = l/dj[x'), both in [O-step] and [Final output] as shown 
in Fig. 2. Compare it with Fig. 1 where simple averaging was used with [fj of 
1/L. 
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3 Experimental result I: artificial data 

The proposed approach was first applied to an artificial function approxima- 
tion problem defined by y = sin{2xi + 3x|) -I- e, where e is from a gaussian 
distribution Af(0, 0.05^1). Each input vector x was generated from one of four 
gaussian distributions W(Cj, 0.3^1), where {(tci, £C2)|(-0.5, -0.5), (0.5, -0.5), (-0.5, 
0.5), (0.5, 0.5)}. A total of 320 data points were generated, with 80 from each clus- 
ter. Some of the data points are shown in Fig. 3. The number of networks was also 
set to 4. Note that clustering was not actually performed since the clustered data 
sets were used. Four 2-5-1 MLPs were trained with the Levenberg-Marquardt 
algorithm for five epochs. 

T-step and 0-step were iterated for 4 generations. At each generation, 80 
virtual data were generated and then merged with the original 80 training data 
for training in the next generation. Note that the merged data set size does 
not increase at each generation by 80, but instead stays at 160 since the virtual 
data are replaced by new virtual data at each generation. For comparison, OLA, 
simple-averaging bagging and adaboost.R2 [4] were employed. For bagging, three 
different ensemble sizes were tried, 4, 15 and 25. For boosting, the ensemble size 
differs in every run. We report an average size which was 36. Experiments were 
run 50 times with different original training data sets. Two different test data sets 
were employed, a small set for a display purpose and a large set for an accurate 
evaluation purpose. The small test set consists of 25 data points with their ID 
numbers, shown in Fig. 3 (LEFT). The mesh shown in Fig. 3 (RIGHT) shows 
the surface of the underlying function to be approximated while the square dots 
represent the output values of the proposed approach MOLA (modular OLA) 
for test inputs. MOLA’s accuracy is shown again in Fig. 4 where 25 test data 
points are arranged by their ID numbers. Note the accuracy of MOLA compared 
with other methods, particularly at cluster centers, i.e. 7, 9, 17 and 19. Bagging 
used 25 networks here. For those inputs corresponding to the cluster centers, the 
familiarity of the corresponding network is highest. 

The large test set consists of 400 data points. Table 1 summarizes the result 
with average and standard deviation of mean squared error (MSE) of 50 runs. 
In terms of average MSE, OLA with 25 networks was best. If we consider MSE 
and ensemble size together, however, MOLA is a method of choice with a rea- 
sonable accuracy and a small ensemble. Since every network in an ensemble is 
trained, the ensemble size is strongly related with the training time. Bagging 
achieved the same average MSE with MOLA by employing more than 6 times 
more networks. MOLA did better than OLA-4, thus OLA seems to need more 
networks than MOLA to achieve a same level of accuracy. Of course, there is an 
overhead associated with MOLA, i.e. clustering at initialization. A fair compar- 
ison of training time is not straightforward due to difference in implementation 
efficiency. Boosting performed most poorly in all aspects. The last row displays 
p-value of pair-wise t-tests comparing average MSEs among methods. With a 
null hypothesis of “no difference in accuracy” and a one-sided alternative hy- 
pothesis “MOLA is more accurate than the other method,” a smaller p-value 
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leads to acceptance of the alternative hypothesis. Statistically speaking, MOLA 
is more accurate than OLA-4, bagging-4 and boosting, but not others. 



Input Space 





MOLA : Output Points 



T 98 t Data Points 



Fig. 3. [LEFT] Artificial data set generated from 4 gaussian distributions with a white 
noise. Each set is distinguished by graphic symbols. [RIGHT] Simulation results on 25 
test data points: the mesh is the surface of the underlying function to be approximated 
while the square-dots represent the test output values from MOLA. 




Fig. 4. 25 test data points are arranged by their ID numbers along x-axis. Note the 
MOLA’s accuracy near the cluster centers (7,9,17,19) compared with that of bagging 
and boosting. 



4 Experimental result II: real-world data 

The proposed approach was applied to real-world regression problems: Boston 
Housing [9] and Ozone [10]. Both data sets were partitioned into 10 and 9 
clusters with K-means algorithm, respectively. These, 10 13-10-1 MLPs and 9 
8-10-1 MLPs were trained with L-M algorithm, respectively. The test results are 
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Table 1. Experimental Results (Artificial Data) 



50 runs 


MOLA 


1 OLA 


1 Bagging 


Boosting 


Ensemble Size 


4 


4 


15 


25 


4 


15 


25 


Avg(36) 


Avg MSE(10“^) 


5.4 


6.5 


4.8 


4.7 


7.3 


5.4 


5.4 


10.5 


Std MSE(10“^) 


4.0 


2.0 


0.9 


0.8 


3.0 


2.7 


1.3 


9.2 


P-value(T-test) 


- 


0.04 


0.87 


0.90 


0.01 


0.99 


1.00 


0.00 



summarized in Table 2. For Boston housing problem, MOLA outperformed both 
bagging and boosting (0 p- values). For Ozone problem, MOLA outperformed 
boosting but not bagging. 



Table 2. Experimental Results (RealWorld Data) 



Tr/Val/Test 

30 runs 


Boston Housing 
200/106/100 

MOLA Bagging Boosting 


MOLA 


Ozone 

200/30/100 

Bagging 


Boosting 


Ensemble Size 


10 


25 


Avg(48) 


9 


25 


Avg (49) 


Avg MSE(10“^) 


9.3 


10.3 


10.9 






21.0 


Std MSE(10“^) 


0.89 


0.96 


1.39 






0.82 


P-value(T-test) 


- 


0.00 


0.00 






0.00 



5 Conclusions 

In this paper, we proposed a modular OLA where each network is trained with 
a mutually exclusive subset of the original training data set. Partitioning is 
performed using K-means clustering algorithm. Then, a same number of net- 
works are trained with the corresponding data clusters. The networks are then 
combined with weighting factors that are inversely proportional to the distance 
between the new input vector and the corresponding cluster centers. 

The proposed approach was compared with OLA, bagging, and boosting in 
artificial function approximation problems and real world problems. The MOLA 
employing a smaller number of networks performed better than OLA and bag- 
ging in artificial data. The MOLA did better in one real data and similarly in 
the other real data. This preliminary result shows that the approach is a good 
candidate for problems where data sets are clustered well. 

Current study has several limitations. First, a more extensive set of data sets 
have to be tried. Second, in clustering, the number of clusters is hard to find 
correctly. The experiments done so far produced a relatively small number of 
clusters, 4 for artificial data and 10 and 9 for real world data. It is worthwhile to 
investigate the test performance with a larger MOLA ensemble. Third, weighting 
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factors are naively set to the distance between input vector and cluster centers. 
An alternative would be to use the weighting factors inversely proportional to 
the training error of the training data close to the input vector. 
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Abstract. Given a data set, consisting of n-dimensional binary vectors 
of positive and negative examples, a subset S of the attributes is called 
a support set if the positive and negative examples can be distinguished 
by using only the attributes in S. In this paper we consider several selec- 
tion criteria for evaluating the “separation power” of supports sets, and 
formulate combinatorial optimization problems for finding the “best and 
smallest” support sets with respect to such criteria. We provide efficient 
heuristics, some with a guaranteed performance rate, for the solution of 
these problems, analyze the distribution of small support sets in random 
examples, and present the results of some computational experiments 
with the proposed algorithms. 



1 Introduction 

We consider the problem of analyzing a data set consisting of positive and nega- 
tive examples for an unknown phenomenon. We denote by T the set of positive 
examples, and by F the set of negative examples, and assume further that each 
example is represented as a binary n-dimensional vector^. This is a typical prob- 
lem setting studied in various fields, such as knowledge discovery, data mining, 
learning theory and logical analysis of data (see e.g. [2, 3, 8, 15, 16, 22, 24, 25].) 
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^ Even if the input dataset contains non-binary attributes, it can be binarized e.g. by 
considering features such as “Age< 40”, “Color = Blue”, etc., see e.g. [7, 21] 
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Let B = {0, 1}", and let us call a pair (T, F), for T, G B” a partially defined 
Boolean function (or pdBf, in short). Let us call a Boolean function / : B B 
an extension of {T,F) if /(a) = 1 for all a G T and f{b) = 0 for all h £ F. 
Such an extension (exists IffTnF = 0) can be interpreted as a logical separator 
between the sets T and F, or equivalently, as an explanation of the phenomenon 
represented by the data set (T, F) (see e.g. [10, 11, 15].) It is quite common that 
the data set contains attributes which are irrelevant to the phenomenon under 
consideration, and also ones which are dependent on some other attributes of the 
data set. It is an interesting problem to recognize these, and to find a (smallest) 
subset of the essential attributes, which still can explain the given data set (T, F). 
Finding a small subset of the variables, which explains the input data “as much 
as possible”, is in the spirit of Occam’s razor [6], and is a frequent subject of 
studies in computational learning theory (see e.g. “attribute efficient learning” 
[4, 13, 20]). In some practical applications, such a subset is sought to reduce the 
cost of data collection. 

This problem has been an active research area for many years within statis- 
tics and pattern recognition, though most of the papers there dealt with linear 
regression [5] and used assumptions not valid for most learning algorithms [18]. 
Many related studies in machine learning considers this problems in conjunction 
with some specific learning algorithms [1, 14, 19, 17]. In this study we accepted 
the “small feature set” bias as in [I] and found further supporting evidence by 
analyzing the distribution of small support sets in random data sets. We also 
consider feature selection as a stand alone task, independently from the applied 
learning methods. For this we develop a family of measures, and formulate exact 
optimization problems for finding the “best” feature subset according to these 
measures. We propose polynomial time solvable continuous relaxations of these 
problems providing a “relevance” weight for each attribute (c.f. [12,19]), stan- 
dard greedy heuristics, as well as reverse greedy type heuristics with a worst 
case performance guarantee. We have tested on randomly generated data sets 
the effects of the presence of both dependent and irrelevant features.^ 



2 Support sets and measures of separation 

Let V = {1,2,. ..,n} denote the set of indices of the attributes in the input 
data set. For a subset S CV and a vector a G B", let a]^] denote the projection 
of a on S, and for a set X C B" let us denote by X[5] = (a[5'] [ o G X} the 
corresponding family of projections. Let us call a subset S CV a support set of 
a pdBf (T, F) if r]^] n F[S] = 0 (c.f. [15].) 

First, we estimate the number of support sets of a given size K in randomly 
generated pdBf-s. Let us suppose that the vectors of T and F are chosen uni- 
formly and independently from B", drawing mr and mp vectors, respectively. 

^ Due to space limitations, we omit proofs, some of the separation measures, and 
several of the computational results. In the complete version [9] we have also included 
encouraging results with some real data sets. 
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(Note that \T\ < mr or \F\ < mp may occur due to duplications.) Let n{K) 
denote the number of support sets of size K in this random pdBf (T, F). 



Theorem 1. 



2 ^ — rriT 
2 ^^ 



< 



E[n(iL)] 



< min 
( 5>0 









+ e 



These bounds are reasonably close, and imply the following. 

Corollary 1. Suppose mp = o(2^^/^), mp = n > {{Key/ ^ + 1)K —1 

and Inlnn = o{K). Then E(n(iL)) = 1 implies K + log 2 K = log 2 (mTm,ir)(l + 
o(l)), where o(l) 0 as K ^ oo. 



In other words, support sets of size smaller than K, where K + log 2 K = 
log 2 {mTmp) are rare in random data with mpmp sufficiently large. Thus when 
exists, such a support set is more probable to be related to the real phenomenon, 
than to random noise. 

For two vectors a, 6 £ B”, let d{a,b) = \aAb\ denote the Hamming distanee 
between a and b, where aAb = {j\aj y bj}. For a subset 5 C let us consider 
its Hamming distance vector h{S) = {ho, hi, ..., /i„), where h^ = hk{S) is 
the number of pairs {a,b), a £ T[S], b £ F[S] exactly at Hamming distance k. 
Intuitively, S is better separating than S' if h{S) <l h{S'), where <p denotes the 
lexicographie order^. Defining 4>a{S) Yyi=Q hk{S){l — a^) for some 0 < a < 1, 
the above inequality is equivalent with 4>a{S) > 4>a{S'), assuming a is small 
enough. The special case of 0{S) (t’oiS) = |T||F| — ho{S) appears frequently 
in the literature, similarly to the minimum Hamming distance measure p{S) '=^ 
miUagT hgi? d(a[5'], 6[5]). Finally, let us consider the following problem 



max {y{S) I 5 C H, [S'! < K} (<F) 

where -ip can be any of p, 6 or cpa (0 < a < 1), and iL is a given threshold. 
Let us note that for any of the above measures, the smallest parameter K for 
which (F) has a support set as an optimal solution, is K* , the size of the smallest 
possible support set for (T, F). Hence, these optimization problems are NP-hard, 
in general, since determining K* is known to be NP-hard (see e.g. [15]). 

We propose first a standard greedy heuristic to solve problem {F), which is 
very common in discrete optimization. 



V'-GREEDY: Starting with 5^0, select iteratively j £ F \ 5 for which tp{S U 
{j}) is the largest, and set 5 ^ 5 U {j}, until T[S] n F[S] = 0. 



We can show that if i/j-GREEDY has a guaranteed worst case performance rate. 

It is said that {ho , ..., hn) <l (ho, ..., h'„) if hi < h'i, where i is the smallest index j 
with hj ^ h'j. 



3 





136 E. Boros et al. 



Theorem 2. Problem ('P) is approximable within a factor of (1 — \) by the 
(modified) if-GREEDY, for -ip = 4>a, 1 > a > 0 (including 6 = cpQ.) 

We can also demonstrate that a similar result is very unlikely for the case of 
Ip = p, unless P=NP , see [9]. 

Let us denote by U = [0, 1] the unit interval, and for y £ IP and o, 6 £ B" , 
let dy{a,b) = Y^j^aAbUi- Clearly, dy can be considered a natural extension 
of the Hamming distance. Consequently, the measures of separation defined 
above can also be extended by defining p{y) = minagr.beF dy{a,b), 0{y) = 
Y)aeT,beF'^^^{dy{a,b),l}, and cpa{y) = EaeT.6eF(l “ for 0 < a < 1. 

Using these notations, the continuous relaxation of (P) for a given pdBf (T,F) 
can be written as 

max jV'(y) I V e U", 

It can easily be seen that problem {P'^) becomes a linear programming problem, 
\i p) £ {p,9(, and hence these can be solved efficiently, even for large data sets. 
If p){y) = (paiy) for some 0 < q < 1, problem (P^^) is a concave maximization 
over a convex domain, which again is known to be solvable efficiently^. 

Another heuristic to solve problem (P), the stingy algorithm is based on (P'^). 
It can be viewed as a reverse greedy algorithm, as it starts from S = V and then 
removes elements from S successively until a minimal support set is obtained. 



i/'-STINGY : Starting with K ^ 1 and Z ^ 0, successively solve the optimization 
problem {P^^) extended with the constraints ^ a £ T and 

b £ F, and yj = 0 for j £ Z; if this problem has no feasible solution, then 
set K ^ K + 1, otherwise, let y* £ U” be an optimal solution, set z* ^ 
mina^T, b€F dy* (a, b ) , and let k be the largest integer for which Vij ^ •** > 

where y)^ fz ^ y*„', if k < \Z\, then set K ^ K + 1, otherwise set 

Z ^ {ii, * 2 , ..., ik}', until A + |Z| < n. 



For the stingy algorithm we can show that 

Theorem 3. Algorithm ?/>-STINGY terminates in polynomial time, and returns 
a binary vector y* such that the set S* = {j \ y* = 1} is a minimal support set 
of pdBf (T, F), assuming that problem (P^) is solvable in polynomial time. 

3 Computational results 

In our experiments, we generate pdBf-s with known support sets, and compare 
the performance of the above heuristic algorithms on the basis of how much 
they can recover the original support sets. We partition the variables space 
V = P[ U D U R, where P[ denotes the active variables of the “hidden logic” /, 
D denotes variables depending on the components in P[ (via Boolean functions 

^ For a concave maximization over a convex domain an e > 0 approximation can be 
obtained in polynomial time in the input size and in 1/e, see e.g. [23] 
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Qk : 1 -^ B, fe £ 17,) and R denotes randomly generated components. Then we 

draw vectors x £ B'^ randomly, set Xk = gk{x[H\) for k £ D, and put x into T 
or F, according to the value f{x{H]). 

We have generated three different types of problems: type A with \H\ = 10, 
\D\ = \R\ = 5, type B with \H\ = 5, \D\ = \R\ = 10, and type C with \H\ = 5, 
\D\ = 0 and \R\ = 40. For each type we have generated 40-40 instances in two 
different sizes: size 1 with \T\ + |F| = 200 and size 2 with \T\ + |F| = 400. Table 
1 lists the averages (of 40 runs) for the 3 best performing heuristics for each of 
the above six categories. We can see that (()Q-stingy finds the smallest support 
sets, and recovers the most of the hidden essential attributes, even if there are 
many dependent attributes present. In particular, in all C2 instances the hidden 
support set was recovered perfectly. On the other hand, greedy algorithms per- 
form also reasonably well, and run much faster. In our implementation, greedy 
runs took only fractions of a second, while the stingy algorithm ran several min- 
utes. We can also see that irrelevant attributes are much easier to filter out than 
dependent ones (the latter can arguably be expected.) 



Types 


61-GREEDY 




^.-GREEDY 




>a-STINGY 






|5| 


th 


ro 


VR 


|5| 


VH 


ro 


VR 


\s\ 


VH 


VD 


VR 


A1 


11.18 


0.602 


0.131 


0.267 


11.18 


0.615 


0.109 


0.271 


10.75 


0.740 


0.038 


0.222 


A2 


12.14 


0.756 


0.076 


0.168 


12.12 


0.771 


0.062 


0.167 


10.78 


0.913 


0.010 


0.078 


B1 


6.14 


0.567 


0.400 


0.032 


6.10 


0.595 


0.377 


0.02! 


5.80 


0.704 


0.223 


0.073 


B2 


6.01 


0.591 


0.392 


0.018 


6.00 


0.596 


0.389 


O.OR 


5.70 


0.730 


0.216 


0.053 


Cl 


5.84 


0.866 


0.000 


0.134 


5.78 


0.885 


0.000 


O.llf 


5.67 


0.923 


0.000 


0.077 


C2 


5.24 


0.976 


0.000 


0.024 


5.22 


0.976 


0.000 


0.02^ 


5.00 


1.000 


0.000 


0.000 



Table 1. Performance summary for random functions: a = 0.001 th = \S C\ 171/151, 
TD = |5nr>|/|5|, TR = |5nii|/|5|. 
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Abstract Reduct finding, especially optimal reduct finding, similar to feature 
selection problem, is a crucial task in rough set applications to data mining. In this 
paper, we propose a heuristic reduct finding algorithm, which is based on frequencies 
of attributes appeared in discernibility matrix. Our method does not guarantee to find 
optimal reduct, but experiment shows that in most situations it does; and it is very 
fast. 

Keywords rough set, reduct, discernibility matrix, data mining 

1 Introduction 

The rough set theory provides a formal framework for data mining. It has several 
favorite features such as representing knowledge in a clear mathematical manner, 
deriving rules only from facts present in data and reducing information systems to its 
simplest form, etc. 

Reduct is the most important concept in rough set application to data mining. A 
reduct is the minimal attribute set preserving classification accuracy of all attribute of 
original dataset. Finding a reduct is similar to feature selection problem. All reducts of 
a dataset can be found by constructing a kind of discernibility function from the 
dataset and simplifying it [2]. Unfortunately, It has been shown that finding minimal 
reduct or all reducts are both NP-hard problems. Some heuristics algorithms have 
been proposed. Hu gives an algorithm using significant of attribute as heuristics [4]. 
Some algorithms using genetic algorithm are also proposed. Starzyk use strong 
equivalence to simplify discernibility function [3]. However, there are no universal 
solutions. It‘s still an open problem in rough set theory. 

In this paper, we propose a simple but useful heuristic reduct algorithm using 
discernibility matrix. The algorithm is based on frequencies of attributes appeared in 
discernibility matrix. Our method does not guarantee to find optimal reduct, but 
experiment shows that in most situations it does; And it is faster than finding one 
reduct (see section 5). 

2 Related rough set concepts 

This section recalls necessary rough set notions used in the paper. Detail 
description of the theory can be found in [2]. 

Definition 1 (information system) An information system is a ordered pair 
S=(U, Au{d}), where U is a non-empty, finite set called the universe, A is a non- 
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empty, finite set of conditional attributes, d is a decision attribute. An{d)=0. The 
elements of the universe are called objects or instances. 

Information system contains knowledge about a set of objects in term of a 
predefined set of attributes. The set of objects is called concept in rough set theory. In 
order to represent or approximate these concepts, an equivalence relation is defined. 
The equivalence classes of the equivalence relation, which are the minimal blocks of 
the information system, can be used to approximate these concepts. Concept can be 
constructed from these blocks are called definable sets. As to undefinable sets, two 
definable sets, upper-approximation set and lower-approximation set are constructed 
to approximate the concept. 

Definition 2 (Indiscernibility relation) Let S=(U, Au{d}) be an information 
system, every subset BcA defines an equivalence relation IND(B), called an 
indiscernibility relation, defined as IND(B)={(x,y)GUxU: a(x)=a(y) for every aeB}. 

Definition 3 (Positive region) Given an information system S=(U, Au{d}), let 
XcU be a set of objects and BcA a selected set of attributes. The lower 
approximation of X with respect to B is B,(X)={xg U:[ x]gCX}. The upper 
approximation of X with respect to B is B*(X) = {xgU: [xj^nX 9^0}. The positive 
region of decision d with respect to B is POSB(d)= u{B,(X) ;Xg U/IND(d)} 

The positive region of decision attribute with respect to B represents 
approximate quantity of B. Not all attributes are necessary while preserving 
approximate quantity of original information system. Reduct is the minimal set of 
attribute preserving approximate quantity. 

Definition 4 (Reduct) An attribute a is dispensable in BcA if POSg(d)= POSb_ 
j,j)(d). A reduct of B is a set of attributes B‘cB such that all attributes aeB-B’ are 
dispensable, and POSg(d)= POSg.(d). 

There are usually many reducts in an information system. In fact, on can show 
that the number of reducts of an information system may be up to In order to 

find reducts, discemibility matrix and discernibility function are introduced. 

Definition 5 (discemibility matrix) The discernibility matrix of an information 
system is a symmetric lUIxIUI matrix with entries Cy defined as {aG Ala(Xi)^^a(Xj)} if 
d(Xj) ^^d(Xj), O otherwise. A discernibility function can be constructed from 
discemibility matrix by or-ing all attributes in Cjj and then and-ing all of them 
together. After simplifying the discemibility function using absorption law, the set of 
all prime implicants determines the set of all reducts of the information system. 

However, simplifying discernibility function for reducts is a NP-hard problem. 

3 The principle 

The heuristic comes from the fact that intersection of a reduct and every items of 
discernibility matrix can not be empty. If there are any empty intersections between 
some item Cy with some reduct, object i and object j would be indiscernible to the 
reduct. And this contradicts the definition that reduct is the minimal attribute set 
discerning all objects (assuming the dataset is consistent). 

A straightforward algorithm can be constructed based on the heuristic. Let 
candidate reduct set R=0. We examine every entry Cjj of discernibility matrix. If their 
intersection is empty, a random attribute from Cy is picked and inserted in R; skip the 
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entry otherwise. Repeat the procedure until all entries of discernibility matrix are 
examined. We get the reduct in R. 

The algorithm is simple and straightforward. However, in most times what we 
get is not reduct itself but superset of reduct. For example, there are three entries in 
the matrix: {aj, a 3 },{a 2 , a 3 },{aj}. According the algorithm, we get the reduct {al, a2, 
a3} although it is obvious {a 3 } is the only reduct. Why that happens? 

The answer is that our heuristic is a necessary but not sufficient condition for a 
reduct. The reduct must be a minimal one. The above algorithm does not consider 
this. In order to find reduct, especially shorter reduct in most times, we need more 
heuristics. 

A simple yet powerful method is sort the discernibility matrix according Icjjl. As 
we know, if there is only one element in Cy, it must be a member of reduct. We can 
image that attributes in shorter and frequent Icyl contribute more classification power 
to the reduct. After sorting, we can first pick up more powerful attributes, avoid 
situations like example mentioned above, and more likely get optimal or sub-optimal 
reduct. 

The sort procedure is like this. First, all the same entries in discernibility matrix 
are merged and their frequency are recorded. Then the matrix is sorted according the 
length of every entry. If two entries have the same length, more frequent entry takes 
precedence. 

When generating the discernibility matrix, frequency of every individual 
attribute is also counted for later use. The frequencies is used in helping picking up 
attribute when it is need to pick up one attribute from some entry to insert into reduct. 
The idea is that more frequent attribute is more likely the member of reduct. The 
counting process is weighted. Similarly, attributes appeared in shorter entry get higher 
weight. When a new entry c is computed, the frequency of corresponding attribute 
f(a) are updated as f(a)=f(a)-i-IAI/lcl, for every as c; where lAI is total attribute of 
information system. For example, let f(al)=3, f(a3)=4, the system have 10 attributes 
in total, and the new entry is {al,a3}. Then frequencies after this entry can be 
computed: f(al)=3+ 10/2=8; f(a3)=4-HlO/2=9. 

Empirical results present in later section shows that our algorithm can find 
optimal reduct in most times, and it‘s very fast once discernibility matrix is computed. 

4 The Algorithm 

This subsection presents the algorithm written in pseudo-code. The algorithm is 
designed according the principle given in previous subsection. 

Input: an information system (U, Au{d}), where A=uaj, i=l,...,n. 

Output: a reduct red 

Red=0, count(ai)=0, for i=l,. . .n. 

Generate discernibility matrix M and count frequency of every attribute 
count(aj); 

Merge and sort discernibility matrix M; 

For every entry m in M do 

If (mnRed = = O) 

select attribute a with maximal count(a) in m 
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Red=Redu{a} 

Endif 

EndFor 

Return Red 

Figure 1 A heuristic reduct algorithm 

In line 2, when a new entry c of M is computed, count(aj) is updated. 
count(a;):=count(ai)+n/lcl for every ajE Id. In line 3, Same entries are merged and M is 
sorted according the length and frequency of every entry. Line 4-9 traverses M and 
generates the reduct. 

5 Implementation and complexity issues 

In order to save space, every entry of discemibility matrix is implemented as a 
bit vector whose length equal to lAI. Every attribute is represented by one bit in the bit 
vector. A bit in the bit vector is set to 1 if corresponding attribute is present in the 
entry, 0 otherwise. Step 2 and step 3 are performed simultaneously in the 
implementation. Our implementation is based on standard template library (STL) in 
C++. 

At step2, the cost for computing discemibility matrix is O(IAIIUP). In the worst 
case, there will be IUI(IUI-l)/2 entries in M, thus the sorting procedure in step 3 takes 
at most 0(IUftog(IUP)) = 0(2IUPlog(IUI)). In fact, there are much less entries then the 
worst case for objects in the same class do not produce any entry. 

Step4-step9 traverses M and generates reduct. Due to at most IUI(IUI-I)/2 entries 
in M and lAI items in each entry, the worst time complexity is also O(IAIIUP), though 
in practical applications it typical takes much less time than that of step2 for there are 
usually only a few entries left in M after step 3. So the total price is at most 
0((IAI+logU)IUP), which is less than complexity bound O(IAPlUP) for finding a 
reduct in [1]. Empirical analysis also show that out algorithm is very fast (see below 
section). 

6 Empirical Analysis 

The algorithm is tested under a personal computer running windows 98 with 
Pentium II 266 processor and 64Mb memory installed. 

We have tested our algorithm on 45 UCI datasets [7]. All datasets are discretized 
using MLC utility [6]. In order to find whether our algorithm could find optimal 
reduct, we compute all reducts using bool reasoning methods as described in [2] for 
reference. Note that when we talk about optimal reduct we refer to the shortest reduct. 

Twenty of 45 datasets have only one reduct. Our algorithm finds them 
successfully. These dataset are adult, australian, balance-scale, cars, cleve, diabetes, 
german-org, glass, heart, iris, led?, lenses, letter, monkl, monk2, monk3, parity5+5, 
pima, solar, and vehicle. 

All reducts of three datasets can not be determined. The program runs for quite a 
few hours but there is no sign to end. We stopped the program. So we do not know 
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whether our algorithm find the optimal reduct. These datasets are led24, dna, and 
satimage. 



Results of the rest datasets are summarized in figure 2. 



Dataset 


Instan 
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0.28 
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41 


72 
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wine 
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35 


56 
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0.06 
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67 


17 


43 


37 
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Total 25 
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Yes: 16 sub:4 

super: 2 unkown: 3 





Figure 2 Summery of results of our algorithm 



The leftmost column is dataset names. The 2’“’, 3'“*, 4* column are instance 
numbers, attribute numbers and attribute value numbers of corresponding dataset. The 
5* column is the number of all reducts. The rightmost column is the computing time 
used by our algorithm. The 6* column is the results of our algorithm. The number in 
the bracket is the length of reduct our algorithm found. A yes before it indicting that 
what we found is optimal reduct. A superXX before it indicting that what we find is a 
superset of reduct, XX is the length of optimal reduct. A subXX indicting that what 
we find is a reduct, but not a shortest one, XX is the length of optimal reduct. And an 
unknown before it indicting all reducts can not determined. 
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For example, for dataset auto, all reduct algorithm find the following reducts 
(where the attribute names are coded as integers): 

0 1 4 5 9 10 11 23; 0 1 4 5 9 10 11 14; 0 1 4 5 9 10 11 13; 0 1 4 5 8 10 11; 0 1 4 5 6 
10 11 23 24; 0 1 45 6 10 11 2023;0 1 45 6 10 11 17 23; 0 1 45 6 10 11 16 23; 0 1 4 
5 6 10 11 1523; 0 1 45 6 10 11 1424;0 1 45 6 10 11 14 20; 0 1 45 6 10 11 14 17; 0 

1 4 5 6 10 11 14 16; 0 1 4 5 6 10 11 14 15; 0 1 4 5 6 10 11 12 23; 0 1 4 5 6 10 11 12 
14; 0 1 45 6 10 11 12 13; 

Our algorithm find the optimal reduct: 014581011. 

For dataset german, a super-reduct is obtained. All reducts are: 0235611 13 
14 16 18; 0 1 3 5 6 8 9 11 13 16 18 19; 0 1 2 3 5 6 9 11 13 16 18; 0 1 2 3 4 5 6 8 11 
13 14 16;0 1 2 345 6 89 11 13 16. Our solution is 0 2 3 5 68 11 13 14 16 18. 

For dataset breast, a sub-optimal reduct is found. All reducts are: 134567 8;! 

2 4 5 6 8. Our algorithm has found first one. 

From figure 1, we can see that our algorithm found most optimal reducts 
successfully. Even for the non-optimal situation, our algorithm can find satisfactory 
sub-optimal solution. And it is fast. 

7 Conclusion 

In this paper we propose an efficient heuristic optimal reduct algorithm. This 
algorithm makes use of frequency information of individual attribute in discernibility 
matrix, and develops a weighting mechanism to rank attributes. The method does not 
guarantee to find optimal reduct, but experiment shows that in most situations it does. 

Further research direction includes enhancing our algorithm to incremental 
version, developing a more efficient weighting mechanism, etc. 
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Abstract. Backpropagation is often used as the learning algorithm in 
layered-structure neural networks, because of its efficiency. However, 
backpropagation is not free from problems. The learning process some- 
times gets trapped in a local minimum and the network cannot produce 
the required response. In addition, The algorithm has number of pa- 
rameters such learning rate (/i), momentum factor (a) and steepness 
parameter (A), whose values are not known in advance, and must be 
determined by trail and error. The appropriate selection of these pa- 
rameters have large effect on the convergence of the algorithm. Many 
techniques that adaptively adjust these parameters have been developed 
to increase speed of convergence. A class of algorithms which are devel- 
oped recently uses learning automata (LA) for adjusting the parameters 
/i, q;, and A based on the observation of random response of the neu- 
ral networks. One of the important aspects of learning automata based 
schemes is its remarkable effectiveness as a solution for increasing the 
speed of convergence. Another important aspect of learning automata 
based schemes which has not been pointed out earlier is its ability to 
escape from local minima with high possibility during the training pe- 
riod. In this report we study the ability of LA based schemes in escaping 
from local minma when standard BP fails to find the global minima. It 
is demonstrated through simulation that LA based schemes comparing 
to other schemes such as SAB, Super SAB, Fuzzy BP, ASBP method, 
and VLR method have higher ability in escaping from local minima. 



1 Introduction 

The multilayer feedforward neural network models with error back-propagation 
(BP) algorithm have been widely researched and applied [1]. Despite the many 
successful applications of backpropagation, it has many drawbacks. For complex 
problems it may require a long time to train the networks, and it may not train 
at all. It is pointed out by numerous researches that BP can be trapped in local 
minima during gradient descent and in many of these cases it seems very unlikely 
that any learning algorithm could perform satisfactorily in terms of computa- 
tional requirements. Long training time can be the result of the non-optimum 
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values for the parameters of the training algorithm. It is not easy to choose ap- 
propriate values for these parameters for a particular problem. The parameters 
are usually determined by trial and error and using the past experiences. For 
example, if the learning rate is too small, convergence can be very slow, if too 
large, paralysis and continuous instability can result. Moreover the best value 
at the beginning of training may not be so good later. Thus several researches 
have suggested algorithms for automatically adjusting the parameters of train- 
ing algorithm as training proceeds, such as algorithms proposed by Arabshahi 
et al. [2], Darken and Moody [3], Jacobs [3], and Sperduti and Starita [4] to 
mention a few. Several learning automata (LA) based procedures have been re- 
cently developed [5] [6] [7] . In these methods variable structure learning automata 
(VSLA) or fixed structure learning automata (FSLA) have been used to find the 
appropriate values of parameters for the BP training algorithm. In these schemes 
either a separate learning automata is associated to each layer or each neuron 
of the network or a single automata is associated to the whole network to adapt 
the appropriate parameters. Through the computer simulations, it is shown that 
the learning rate adapted in such a way increases the rate of convergence of the 
network by a large amount. 

When we use learning automata as the adaptation technique for BP param- 
eters, the search for optimum is carried out in probability space rather than 
in parameter space as is in the case with other adaptation algorithms. In the 
standard gradient method, the new operation point lies within a neighborhood 
distance of the previous point. This is not the case for adaptation algorithm 
based on stochastic principles such as learning automata, as the new operating 
point is determined by probability function and is therefore not considered to be 
near the previous operating point. This gives the algorithm higher ability to lo- 
cate the global minima. In this paper we study the ability of LA based schemes in 
escaping from local minima when standard BP fails to find the global minima. In 
this paper, It is demonstrated through simulation that LA based schemes com- 
paring to the other schemes such as SAB [3] , SuperSAB [3] , adaptive steepness 
method(ASBP)[4], variable learning rate(VLR) method [8] and Fuzzy BP [2] 
have higher ability in escaping from local minima, that is BP parameter adap- 
tation using LA bases scheme increases the likelihood of bypassing the local 
minimum. 

The rest of the paper is organized as follows. Section 2 briefly presents basics 
of learning automata. Existing LA based adaptation schemes for BP parame- 
ters are described in section 3. Section 4 demonstrates through simulations the 
ability of LA based schemes in escaping from local minima. The last section is 
conclusion. 

2 Learning Automata 

Learning automata operating in unknown random environments have been used 
as models of learning systems. These automata choose an action at each instant 
from a hnite action set, observe the reaction of the environment to the action 
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chosen and modify the selection of the next action on the basis the reaction. 
The selected action serves as the input to the environment which in turn emits a 
stochastic response. The environment penalize the automaton with the penalty 
Ci, which is the action dependent. On the basis of the response of the environ- 
ment, the state of the automaton is updated and a new action chosen at the 
next time instant. Note that the { q } are unknown initially and it is desired that 
as a result of interaction with the environment the automaton arrives at the ac- 
tion which presents it with the minimum penalty response in an expected sense. 
If the probability of the transition from one state to another state and proba- 
bilities of correspondence of action and state are hxed, the automaton is said 
hxed-structure automata and otherwise the automaton is said variable-structure 
automata. Examples of the FSLA type that we use in this paper are Tsetline, 
Krinsky, TsetlineG, and Krylov automata. For more information on learning 
automata refer to [9] . 

3 LA Based Schemes For Adaptation of BP Parameters 

In this section, we first, briefly describe LA based schemes for adaptation of BP 
parameters [5] [6] [7]. In all of the existing schemes, one or more automaton have 
been associated to the network. The learning automata based on the observation 
of the random response of the neural network, adapt one or more of BP param- 
eters. The interconnection of learning automata and neural network is shown 
in figure 1. Note that the neural network is the environment for the learning 
automata. The learning automata according to the amount of the error received 
from neural network adjusts the parameters of the BP algorithm. The actions 
of the automata correspond to the values of the parameters being calculated 
and input to the automata is some function of the error in the output of neural 
network. 




Value of parameter 
being adapted 



Fig. 1. Automata-neural network connection 



Existing LA based procedures for adaptation of BP parameters can be clas- 
sified into four groups which we call them group A, B, C, and D. In group 
A schemes, one automaton is used for the whole network whereas in group B 
schemes, separate automata one for each layer (hidden and output) are used [5]. 
Each group A and B depending on the type of automata used (fixed or vari- 
able structure) can be classihed into two sub-groups. The parameter adapted 
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by group A schemes will be used by all the links or neurons of the networks 
and therefore these schemes fall in the category of global parameter adaptation 
method, whereas group B schemes by adapting the parameter for each layer in- 
dependently may be referred to as quasi-global parameter adaptation methods. 

In a class C scheme one automata is associated to each link of the network 
to adjust the parameter for that link and in a class D scheme one automata is 
associated to each neuron of the network to adjust the parameter for that neuron. 
Group C and D schemes may be referred to as the local parameter adaptation 
methods. In [6] class C schemes are used for adaptation of learning rate and class 
D schemes are used for adaptation of steepness parameter. In class C and D 
schemes, the automata receives favorable response from the environment if the 
algebraic sign of derivative in two consecutive iterations is the same and receives 
unfavorable response if the algebraic sign of the derivative in two consecutive 
iterations alternates. 

For the sake of convenience in presentation, we use the following naming 
conventions to refer to different LA based schemes in classes A, B, C, and D. 
Without loss of generality, we assume that in class A and class B, the neural 
network has one hidden layer. 

Automata — AX{'j) A scheme in class A for adjusting parameter 7 which uses 
X structure LA Automata. 

Automatai — Automata2 — BX (7) A scheme in class B which uses X structure 
LA for hidden layer and X structure LA Automata2 for output layer. 
Automata — CX(-y) A scheme in class C for adjusting parameter 7 which uses 
X structure LA Automata. 

Automata — DX{-y) A scheme in class D for adjusting parameter 7 which uses 
X structure LA Automata. 

The rate of convergence can be improved if both learning rate and steepness 
parameter are adapted simultaneously. Simultaneous use of class C and class D 
schemes for adaptation of learning rate and steepness parameters is also reported 
in [6] . A LA based scheme that simultaneously adapts learning rate and steepness 
parameter is denoted by Automatai — Automata2 — CDX{ii, A), if A structure 
LA is used and a LA based scheme which simultaneously adapt the learning rate 
and momentum factor is denoted by Automatai — Automata2 — CX{rj, a) when 
X structure LA is used. X denotes either fixed or variable structure automata. 
For all the LA based schemes reported in the literature, it is shown through 
simulation that the use of LA for adaptation of BP learning algorithm parameters 
increases the rate of convergence by a large amount [6] . 

4 LA Based Schemes and Local Minima 

In this section, we examine the ability of the LA based schemes to escape from 
local minima. For this propose, we chose a problem in which local minima are 
occurred frequently [10]. This example considers the sigmoidal network for the 
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Table 1. Training set for given problem 
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XOR boolean function with the quadratic cost function and the standard learn- 
ing environment. The training set of this problem is given in table 1. 

The network which is used has two input nodes x and y, two hidden units, 
and one output unit. In this problem, if hidden units produce the lines a and 
b the local minima has been occurred and if hidden units produce the lines c 
and d the global minima occurred [10]. Figure 2 shows these configurations. The 
error surface of the network as a function of weights W 2 .i,i and is given in 

figure 3. 




Fig. 2. Lines produced by hidden units 



Depending on the initial weights, the gradient can get stuck in points where 
the error is far from being zero. The presence of these local minima is intuitively 
related to the symmetry of the learning environment. Experimental evidence of 
the presence of local minima is given in hgure 3. 

In order to show how well the LA based adaptation algorithm escapes local 
minima we test twelve different LA based algorithms, 4 from class A, 4 from 
class B, 1 from class C, 1 from class D, and 1 from class CD, and compare 
their results with the standard BP and five other known adaptation method: 
SAB, SuperSAB, VLR method, ASBP method, and fuzzy BP. The result of 
simulation for 20 runs are summarized in table 1. Note that for standard BP 
and also for standard BP when SAB or SuperSAB method is used to adapt the 
learning rate none of the 20 runs converges to the global minima. Among the 
non-LA based methods the ASBP method performs the best. For this scheme 7 
out of 20 runs converges to global minima which is comparable to the some of 
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Fig. 3. Error surface as a function of weights and ici.i.i 



tlie IjA itased schemes we have tested, d’lie best result is obtained for algorithm 
T setline — AF{X) for which 13 runs out of 20 runs converges to global minimum. 
The next best result belongs to J — J — CDF(f.i,X) scheme. In simulations, the 
initial weights for the network are chosen in such a way the network starts from 
a point in the close neighborhood of local minima. 

The reason for such a good performance of LA based schemes is that in the 
standard gradient method, the new operation point lies within a neighborhood 
distance of the previous point. This is not the case for adaptation algorithm 
based on stocdiastic principles, as the new operating point is determined by 
probability function and is therefore not coasidered to be near the previous 
operating point. This gives the algorithm higher ability to locate the global 
optimum. In general, the LA approacdi has two distinct advantages over classical 
hill climbing methods: 1 ) the parameter .space need not be metric and 2) since the 
search space is cotiducted in the path probability space than parameter space, a 
global rather than a local optimum can be found. 



5 Conclusion 

In this report we studied the ability of LA based schemes in escaping from local 
minima wlien standard BP fails to find the global minima. It is demonstratetl 
through simulation that LA based schemes comparing to other schemes such 
as SAB ,SuperSAB, Fuzzy BP, ASBI’ method, and VLB Method have higher 
ability in escaping from local minima. It must be mentioned that just as BP can 
not guarantee convergence to the global minima solution, neither can LA-based 
schemes. This is a problem inherent to a localized optimization technique such 
as st«!epest descend, of which backpropagation is an special case. 
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Table 2. 



Algorithm 


Class 


Not Converged 


conveged 


DP 




20 


0 


SAD 




20 


0 


Supers AD 




20 


0 


VLR 




IS 


2 


Fuzzy DP 




18 


2 


ASDP 




13 


7 


Tsetline — AF{fi) 


A 


IS 


2 


Krinsky — AF{p.) 


A 


14 


6 


Krylov — AF{p) 


A 


17 


3 


Lji-p — AF(p.) 


A 


16 


4 


Tsetline — AF{\) 


A 


7 


13 


Tsetline — TsetlineG — DF{p.) 


B 


18 


2 


Tsetline — Krylov — BF{p) 


B 


18 


2 


Tsetline — Krinsky — BF{p.) 


B 


15 


5 


Tsetline — Tsetline — DF{p.) 


B 


15 


5 


J - DF{\) 


D 


15 


5 


./ - CF(m) 


C 


13 


7 


J - J - CDF{tj,,X) 


CD 


8 


12 
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Abstract. Generalization of the covariance concept is discussed for mixed 
categorical and numerical data. Gini’s definition of variance for categorical data 
gives us a starting point to address this issue. The value difference in the original 
definition is changed to a vector in value space, giving a new definition of 
covariance for categorical and numerical data. It leads to reasonable correlation 
coefficients when applied to typical contingency tables. 



1 Introduction 

Covariances and correlation coefficients for numerical data express the strength of a 
correlation between a pair of variables. Such convenient measures have been expected 
for categorical data, and there have been many proposals to define the strength of a 
correlation [1]. However, none of these proposals has succeeded in unifying the 
correlation concept for numerical and categorical data. 

Recently, variance and sum of squares concepts for a single categorical variable 
were shown to give a reasonable measure of the rule strength in data mining [2]. If we 
can introduce a covariance definition for numerical and categorical variables, more 
flexible data mining schemes could be formulated. 

In this paper we propose a generalized and unified formulation for the covariance 
concept. Section 2 introduces Gini’s definition of variance, and its limitations. A new 
definition of covariance is proposed in Section 3. Samples of covariance matrices and 
correlation coefficients are shown for typical contingency tables in Section 4. 
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2 Gini’s Definition of Variance and its Limitations 



Gini successfully defined the variance for categorical data [3]. He first showed that the 
following equality holds for the variance of a numerical variable x,-. 

K-= = , (1) 

where V,,- is the variance of the i-th variable, Xia is the value of x,- for the a-th instance, 
and n is the number of instances. 



Then, he gave a simple distance definition (2) for a pair of categorical values. The 



variance defined for categorical data was easily transformed to the expression at the 



right end of (3). 



= 1 if ^ x,j 

= 0 if X. = X,.,, 



a b ^ V '■ b 

Here p,(r) is the probability that the variable x, takes a value r. The resulting expression 
is the well-known Gini-index. 

The above definition can be extended to covariances by changing (x,a - x®)^ to (x,v, - 
XibKxja - Xjb) [4]. However, it does not give reasonable values relative to correlation 
coefficients. The difficulty can be seen in the contingency table example of Table 1. 
There are two variables, x, and Xj, each of which takes three values. Almost all instances 
appear in the diagonal positions, and hence the data should have a high Vij. The problem 
arises when we consider an instance at (f, v). Intuitively, this instance should decrease 
the strength of the correlation. However, there appears to be some positive contribution 
to Vij between this instance and that at (r, u). It comes from the value difference pair, (x,: 



rit, Xj\ ulv), which is different from the major value 
difference pairs (xg ris, x,: ulv), (x,-; rIt, x,-: ulw) and 
(x,-: s!t, xf. v/w). This contradiction comes from (2) 
in that it does not discriminate between these four 
types of value difference pairs. 



Table 1. A sample contingency 
table with high correlation. 
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3 Generalized Covariance 

We propose a scheme to generalize the definition of a covariance for categorical data. It 
employs Gini’s variance definition (3) as the starting point, and introduces two 
additional concepts. The first is to represent the value difference as a vector in value 
space. The other is to regard the covariance as the extent of maximum overlap between 
vectors in two value spaces. 

3.1 Vector Expression of a Value Difference 

We employ a vector expression, instead of the distance, Xia - Xn,, in the variance 
definition. When jc, is a numerical variable, the expression is a vector in 
one-dimensional space. The absolute value and sign of (xn, - Xia) give its length and 
direction, respectively. 

Now let us think of a categorical variable, jc„ that can take three values, (r i t). We 
can position these values at the three vertices of an equilateral triangle as shown in 
Figure 1 . Then, a value difference is a vector in 
two-dimensional space. The length of every edge is set to 1 
to adapt the distance definition of (2). If there are c kinds of 
values for a categorical variable, x„ then each value can be 
matched to a vertex of the regular polyhedron in 
(c-l)-dimensional space. 

3.2 Definition of Covariance, Vy 

Our proposal for the Vy definition is the maximum value of Qi/yL) while changing L, 
and QijiL) is defined by the subsequent formula, 

Vij=max{Qy{L)) 




(4) 




A Note on Covariances for Categorical Data 153 



Here, L is an orthogonal transformation applicable to the value space. The bracket 
notation, <c|L|/>, is evaluated as the scalar product of two vectors e and //(or L'V and 
f). If the lengths of the two vectors, e and / are not equal, zeros are first padded to the 
vector of the shorter length. 

In general, L may be selected from any orthogonal transformation, but we impose 
some restrictions in the following cases. 

1. When we compute the variance, V,„ L must be the identity transformation, since two 
value difference vectors are in the identical space. 

2. A possible transformation of L is (1) or (-1) when the vector lengths of e and /are 
unity. However, if both Xi and Xj are numerical variables, we always have to use the 
transformation matrix, (1), in order to express a negative correlation. 



3.3 Assumed Properties for Bracket Notations 

We assume several properties when using bracket notation, as follows. All these 
properties are easily understood as properties of a vector. 



rr 

X rs 



uv) = irs 



rs 



rs 



rs 



uvj = {rs 
|L| = —(rs 

|L|Mv\ + /ri|L| 



MM > = 0.0 . 

l\ 

l\ 



X uv) = X (rs 



uv 



vuj = -( sr 



L\uv) = {sr\L\ 



vu 



uv) + (st 



vw^ = (rs 
uv) = (rt 



uwj . 



uv 



(6) 

(7) 

(8) 

(9) 

( 10 ) 



Furthermore, we can assume (11) without loss of generality, while (6) and (11) are 
alternative definitions for the original distance definition (2). 



rs 



ri > = 1 .0 . 



( 11 ) 
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4 Samples of Covariance Matrices 

There is no way to prove the proposed covariance definition. Covariance matrices are 
derived for typical contingency tables to facilitate the understanding of our proposal. 



4.1 2x2 Categorical Data 





u 


V 




r 


^ru 


Hr. 


n^. 




^su 


yisv 


n,. 




n.u 


n.y 


n 



Our first example is a simple 2x2 contingency table shown 
at the left, where n^. and w.„ represent marginal distributions. 
The straightforward application of (5) to this table gives the 
following expressions for Qu. 



Q-- {l) = (rr |l| rr^ + (rs |l| rs^ + (sr |l| srj + (^ss |l| 

1 



( 12 ) 



= — ;r 2n n (rs \L\ rs) = — 

2„2 r. v\ M / 



The resulting expression does not depend on L. Vu and Vjj are given by (13) and by (14). 
These expressions are identical to those of Gini. 

V,=n.njn^ -{njnf) . (13) 

= n.un , /«' = i (l - («„ / nf - (n„ jnf ) . (14) 

The same procedure gives the following expression, where (8) is used to derive 
the second line. 



1 




\vu) 


2«" 




l\ vu) 

1 u 



= -y ){rs L uv) . 

n \ > 



(15) 



Here, rs and mv, are expressed by a vector with one element. The transformation matrix 
L can take the value (1) or (-1), a 1x1 matrix. Therefore, the expression of is given 
by the next formula. 
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The numerator of this formula is the critical term used to represent the extent of 
dependency between two variables. In fact, the correlation coefficient, defined 
by is 1.0 (0.0) for completely dependent (independent) data, respectively. 

4.2 2x3 Categorical Data 





u 


V 


w 




r 


^ru 


yt-rv 


yirw 


Ur. 


s 


^su 


y^sv 


y^sw 


ns. 




n.u 


n.v 


n.„ 


n 



Vjj takes the following expression, 

=t(i-(«»/«)'-(«v/«)' -(«»/")') • ( 17 ) 

Without loss of generality, the value difference vectors 
and the transformation, L, can be written as 

^1 






^3^ 



L{d,a) = 



' COS0 


sin0^ 


fi or 


-sin0 


COS0^ 


[o -ij 



(18) 

(19) 



where the value of o is 0 or 1 , and the range of 0 is from -7t to 7t. Using the upper (lower) 
sign for o= 1 (0), Q,//.) is given by 



/ 


, / cose ) 


1 / 'J 


' ()4)cose + ('/^)sine 


/rn 


[n n —n n 1 I 

V ,.v rv 




^-(X)sine + ('^)cose^ \ 




^ ()4)sine±('^)cose ^ 



= cos0{(u,.„n^ — 

+ sin 6» {O + (+ ^2 )(«„«,„ - ) + (± )}/"^ • 



This expression, (20), reduces to (15) if w™ and n,„ are zero. Here, we examine two 



(i) 




Xj 






U V w 


X, 


r 


n/3 n/3 0 




s 


0 0 m/3 


(ii) 




Xj 






\ U V W \ 


X, 


r 


1 

0 

1 




s 


m/6 m/3 m/6. 



Vij = ^ when 0 = , cr = 0, 

Rj,. = 0.707 . 

y = i y = J- 

Vij = ^ when 6 = - ■^/g , O’ = 0, 
= 0.354 . 
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specific contingency tables, (i) and (ii). The resulting correlation coefficients seem to 
have reasonable values in these cases. Shown at the right side are the value spaces of X,- 
and Xj, while (u, v, w) and (u’, v’, w”) show vertices before and after the transformation, 
L, respectively. In the first case, the vector r— >s shows maximum overlap with (u’— >w’ 
+ v’— >w’), and it corresponds to the observation from the contingency table. The same 
sort of overlap is found for (u’— >v’ + w’— >v’) in the second case. 



4.3 3x3 Categorical Data 
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w 
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^ru 


flj-y 


fT^rw 


rir- 
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^su 


yisv 


^sw 
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ntu 


ntv 


^tw 
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n.u 


n.y 


n.„ 


n 



Vii and Vjj take the same expression as that of (17). 
(19) is used as the transformation, L, and the value 
difference vectors are set to the following forms. 




Then, the following formula gives the expression for 2,y. 



a, = 



COS0 



+ (-x)(«.™ 



)+ )- (K)(«n.«.w - ) 

V - )+ (t± )+ (- 7±7)(«n-«m. - ) ■ 

- «»«,»)+(- T± )+ (t± !)(«»«,» - )_ 



(22) 



sin^ 
+ — ^ 



0 + (± )(m„ ) + (± )(«„ n„ - ) 

■ + (- )+ (± #- 



)+(+ #+#)(«„«« 



where upper and lower signs correspond to o = 0 and 1, respectively. 

When we are concerned with the contingency table on the next page, we get R^j =1.0 
as expected. The resulting transformation, 0 = 0 and o = 0, indicates the complete 
correspondence between two value sets, (r/u, s/v, t/w). The exchange of category names 
does not affect the R/j value. That is, even if we move n2 to cell(t, v) and «3 to cell(i', w), 
the Rij value remains equal to unity. In this case, the effect of the value exchange 
appears in the resulting values (0= -7t/6 and o= 1). 
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u 


V 


W 


r 


n\ 


0 


0 


5' 


0 


w2 


0 


t 


0 


0 


w3 



= (nin2 + n2ni + n3n\)l(ni + n2+ «3)^ , 
= V;; when 9 = Q,a = 0 , 

7?, =1.0 . 



5 Conclusion 

We proposed a new definition for the variance-covariance matrix that is equally 
applicable to numerical, categorical and mixed data. Calculations on sample 
contingency tables yielded reasonable results. When applied to numerical data, the 
proposed scheme reduces to the conventional variance-covariance concept. When 
applied to categorical data, it covers Gini’s variance concept. 

This current work does not give an explicit algorithm to compute the 
variance-covariance matrix. Furthermore, we do not discuss the statistical distribution 
of the sample variance and correlation coefficients. Nevertheless, this work is expected 
to open the door to a unified treatment for numerical and categorical data. 
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Abstract. We review a recently proposed family of functions for find- 
ing principal and minor components of a data set. We extend the fam- 
ily so that the Principal Subspace of the data set is found by using a 
method similar to that known as the Bigradient algorithm. We then 
amend the method in a way which was shown to change a Principal 
Component Analysis (PCA) rule to a rule for performing Factor Anal- 
ysis (FA) and show its power on a standard problem. We find in both 
cases that, whereas the one Principal Component family all have similar 
convergence and stability properties, the multiple output networks for 
both PCA and FA have different properties. 



1 Introduction 

Principal Component Analysis (PCA) is a well-known statistical technique for 
finding the best linear compression of a data set. PCA uses the eigenvectors 
and corresponding eigenvalues of the covariance matrix of a data set. Let x = 
{xi, ...,X 7 Vf} be iid (independent, identically distributed) samples drawn from a 
data source. If each Xj is n-dimensional, 3 at most n eigenvalues/eigenvectors. 
Let E be the covariance matrix of the data set; then E is n x n. Then the 
eigenvectors, e,, are n dimensional vectors which are found by solving 

Esi = XiBi ( 1 ) 

where Aj is the eigenvalue corresponding to e^. A second standard method is the 
technique of Factor Analysis (FA). PCA and FA are closely related statistical 
techniques both of which achieve an efficient compression of the data but in a 
different manner. They can both be described as methods to explain the data 
set in a smaller number of dimensions but FA is based on assumptions about 
the nature of the underlying data whereas PCA is model free. 

We can also view PCA as an attempt to find a transformation from the data 
set to a compressed code, whereas in FA we try to find the linear transformation 
which takes us from a set of hidden factors to the data set. Since PCA is model 
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free, we make no assumptions about the form of the data’s covariance matrix. 
However FA begins with a specific model which is usually constrained by our 
prior knowledge or assumptions about the data set. 

2 Artificial Neural Networks 

Since the seminal work of Oja [5, 8, 6, 7], a great number of unsupervised neural 
networks have been shown to perform PCA. We investigate one biologially plau- 
sible network : the data is fed forward from the input neurons (the x-values) to 
the output neurons. Here the weighted summation of the activations is performed 
and this is fed back via the same weights and used in the simple Hebbian learn- 
ing procedure. Consider a network with N dimensional input data and having 
M output neurons. Then the output of the output neuron is given by 

N 

y- = acti = ^ WijXj (2) 

where Xj is the activation of the input neuron, Wij is the weight between 
this and the output neuron and acU is the activation of the neuron. This 
firing is fed back through the same weights as inhibition to give 

M 

Xj{t + 1) ^ Xj (t) - ^ WkjVk (3) 

fc=i 

where we have used (t) and (t -f 1) to differentiate between activation at times t 
and Now simple Hebbian learning between input and output neurons gives 

Awij = TjtViXjit + 1 ) 

M 

i=i 

where rjt is the learning rate at time t. This network actually only finds the 
subspace spanned by the Principal Components; we can find the actual Principal 
Components by introducing some asymmetry into the network [3] . 

We have previously shown[4] that, by not allowing the weights in the above 
network to become negative, i.e. enforcing the additional constraint, Wij > 0 in 
our learning rules, our networks weights converge to identify the independent 
sources of a data set exactly. We have recently shown that this rectification may 
be used in general with PCA networks to create FA networks[l]. 

3 The new class of functions 

It has recently been shown [12] that solutions of the generalised eigenvalue prob- 
lem 



Aw = ABw 



(4) 
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can be found using gradient ascent of the form 

= Aw — f{w)Bw (5) 

where the function /(w) : i?” — {0} ^ R satisfies 

1. /(w) is locally Lipschitz continuous 

2. 3Mi > M 2 > 0 : /(w) > AijVw :|| w ||> Mi and/(w) < A„,Vw : 0 <|| 
w ||< M 2 

3. Vw G i?" — {0},3A^i > N 2 > 0 '■ f{0w) > Ai,V0 '■ 0 > Ni and/(6*w) < 
Xm'iO : Q < 6 < N 2 and f{0w) is a strictly monotonically increasing function 
of 0 in [A^i , A^ 2 ] • 

with Ai the greatest eigenvalue and A„ the smallest. Taking A = S = if(xx^), 
and B = I, we return to the standard eigenvector problem which may be used 
to find the principal components of a data set. Since y = w.x, we have the 
instantaneous rule 



Aw oc xy — /(w)w 

For example, if we choose /(w) = ln{w'^ (t)w{t)) [12], we have: 



Awj = rj{xjy 2 — In(w’^w)w) (6) 

Similarly, we can use: 

Awj = ri{xjy - ln(^ | Wk \)wj) (7) 

k 

Awj = T]{xjy — ln( max | Wk \)wj) (8) 

l<k<n 

Awj = ri{xjy — (w'^w — (f>)wj) (9) 

Awj = r]{xjy — ( max | Wk \ —4>)wj) (10) 

l<k<n 

Awj = ri{xjy - | Wk \ -4>)wj) ( 11 ) 

k 



The functions in (6)-(ll) will be known as /i(), .., /e() in the following and all 
provide iterative solutions to the maximisation of JpcA which may either be 
defined in terms of minimisation of least mean square error or as best linear 
maximisation of variance. 

Now (9) is simply the rule for the Bigradient algorithm with a single output 
[11] and so we now discuss the Bigradient algorithm. 



4 Multiple Outputs 

Now the bigradient algorithm [11] for multiple outputs may be viewed as the 
optimisation of three parts 
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1. The first part optimises the PCA criterion: e.g. it may be viewed as max- 
imising the variance ot the outputs ot the neural network. The instantaneous 
optimisation is ot 

Ji = max(w.x)^ (12) 

W 

2. The second part ensures that the vector is ot length 1. This constraint may 
be thought of as optimising 

J2 = min(l — w.w)^ (13) 

W 

3. The third part ensures that each weight vector is orthogonal to every other 
weight vector: 

J3 = minT^ (wi.Wj)^ (14) 

This suggests turning the argument in the last section on its head. It each ot 
these rules using the derivatives of Ji and either that of J2 (the one neuron 
bigradient algorithm) or any of the other equivalent functions, /i(), finds the 
first principal component, can we adjust each rule by inserting a term ensuring 
that J3 is also optimised so that all weights learn to respond to different inputs. 
This gives us a learning rule 

Awi = r]{-x.yi - f{W)wi - ^(wfc.wi)wi) (15) 

k^i 

where we have used Wi as the weight vector and W as the matrix of all weight 
vectors. 

To this end, we perform a series of simple experiments: we generate artifi- 
cial data such that x\ is a sample from a Gaussian distribution with standard 
deviation 1 (i.e. N(0,1)), X 2 is drawn from N(0,2) etc so that the input with the 
highest variance is clearly the last. In our experiments we chose 5 inputs and 3 
outputs and began the simulation with a learning rate of 0.0001 which decreased 
to 0 in the course of 100000 iterations. Typical results are shown in Table 1. If 
the principal subspace spanned by the first three PCs is found, there will be 
zeros in the first two positions in the weight vector (as for /4O). 

We see that there are three distinct groups of results 

1. The first three functions fail to find the subspace of the first three Principal 
Components. Extensive experimentation with different learning rates, initial 
conditions and number of iterations have all resulted in similar results. 

2. The fourth function is exactly the usual bigradient algorithm. As expected it 
finds the subspace of the first three Principal Components (though it should 
be noted that the normalisation is far from secure). 

3. The fifth and sixth functions are the most interesting: they also find the 
subspace but more interestingly actually seem to be able to identify the 
actual Principal Components themselves. 

The first thing to point out is that the equivalence of the six functions in the 
one neuron case no longer exists in the multiple neuron case. The fifth and sixth 
functions are different in that they use the absolute value of the weights. 
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-0.2 


2.9 


W 3 


0 


0 


0 


1.7 


2.1 


1 Function /s)) 


Function /e() | 


Wi 


0 


0 


-0.1 


-6.2 


0 


Wi 


0 


0 


- 0.1 


- 6.0 


0 


W2 


0 


0 


-0.1 


0 


9.2 


W2 


0 


0 


- 0.1 


0 


9.0 


W3 


0 


0 


-4.0 


0.1 


0 


W3 


0 


0 


- 3.9 


0.1 


0 



Table 1. Converged vectors for three outputs and five inputs using each of the functions 
/i(), /e(). The first three functions fail to find the Principal Subspace. The last three 
find the Subspace, and the last two come close to finding the Principal Components 
themselves. 



5 Multiple outputs and Non- negativity 

This last finding is interesting to us in the light of our previous changes to a 
PCA network which transformed it into a Factor Analysis network by enforcing 
a positivity constraint on the weights. Thus we use the same learning rules as in 
Section 4 but impose the constraint that we cannot have negative factor loadings. 

The benchmark experiment for this problem is due to Foldiak [2]. The input 
data here consists of a square grid of input values where = 1 if the square 
is black and 0 otherwise. However the patterns are not random patterns: each 
input consists of a number of randomly chosen horizontal or vertical lines. The 
important thing to note is that each line is an independent source of blackening 
a pixel on the grid: it may be that a particular pixel will be twice blackened by 
both a horizontal and a vertical line at the same time but we need to identify 
both of these sources. We again find that fe{) in its multiple output (bigradient) 
form is the best at finding the independent components, even outperforming the 
original bigradient algorithm with the non-negativity constraint. We do however 
have to loosen the force with which each weight vector repels its neighbours by 

Awi = ri{xyi - f{W)wi - (16) 

k^i 



where 7 < 1 . 

6 Conclusion 



We have reviewed a new class of one neuron artificial neural networks which had 
previously been shown to find the first Principal Component of a data set. In the 
one neuron case, it has been stated that the members of this class of networks 
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are all equivalent in terms of convergence and stability (this is readily confirmed 
experimentally). We have extended this class of neural networks by amending 
the learning rule so that each neuron learns to respond to a different section of 
the data - the Principal Subspace is found - by using a method suggested by 
the bigradient algorithm. However now we see that the class of functions is no 
longer homogeneous in terms of its convergence; clearly the interaction between 
the criteria is having a differential effect. Interestingly two of the members of 
the class find the actual principal components of the data set. 

We have also implemented the rectification of the weights which transformed 
a PCA network to a FA network and again found that some functions find the 
underlying factors of the data set while others do not. 

Future work will investigate deflationary algorithms [10] and lateral inhibition 
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Abstract. Canonical Correlation Analysis [3] is used when we have two 
data sets which we believe have some underlying correlation. In this 
paper, we derive a new family of neural methods for finding the canonical 
correlation directions by solving a generalized eigenvalue problem. Based 
on the differential equation for the generalized eigenvalue problem, a 
family of CCA learning algorithms can be obtained. We compare our 
family of methods with a previously derived [2] CCA learning algorithm. 
Our results show that all the new learning algorithms of this family have 
the same order of convergence speed and in particular are much faster 
than existing algorithms; they are also shown to be able to find greater 
nonlinear correlations. They are also much more robust with respect to 
parameter selection. 



1 Canonical Correlation Analysis 

Canonical Correlation Analysis is a statistical technique used when we have two 
data sets which we believe have some underlying correlation. Consider two sets 
of input data; xi and X2. Then in classical CCA, we attempt to find the linear 
combination of the variables which give us maximum correlation between the 
combinations. Let 



yi = wixi = ^ wijXij 
3 

y2 = W 2 X 2 = ^ yJ2jX2j 
3 

where we have used x^ as the element of x,. Then we wish to find those 
values of wi and W2 which maximise the correlation between yi and y2. Then 
the standard statistical method (see [3]) lies in defining; 

T'li = T^{(xi - /ri)(xi - 

^22 = E{(x2 - /i2)(x2 - /T2)^} 

Ei2 = -E{(xi - /il)(x2 - /T2)^} 

and K = A'i2X'22^ (1) 

K.S. Leung, L.-W. Chan, and H. Meng (Eds.): IDEAL 2000, LNCS 1983, pp. 164-173, 2000. 
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where T denotes the transpose of a vector and -El} denotes the expectation 
operator. We then perform a Singular Value Decomposition of K to get 

K = (ai,a2,-,afc)E>(/3i,/32,...,/3fe)^ (2) 

where ai and f 3 i are the standardised eigenvectors of KK^ and K respectively 
and D is the diagonal matrix of eigenvalues. Then the first canonical correlation 
vectors (those which give greatest correlation) are given by 

wi = (3) 

W 2 = ^ 2 ^ Pi ( 4 ) 

with subsequent canonical correlation vectors defined in terms of the subsequent 
eigenvectors, at and Pi. 

2 A neural implementation 

A previous ’neural implementation’ [ 2 ] of CCA was derived by phrasing the 
problem as that of maximising 

J = E{(yiy 2 ) + ~ 

where the \t were motivated by the method of Lagrange multipliers to constrain 
the weights to finite values. By taking the derivative of this function with respect 
to both the weights, Wi and W2, and the Lagrange multipliers, Ai and A2 we 
derive learning rules for both: 

Awij = r]Xij{y2 - Xiyi) 

A\i = r]o{l - yl) 

Aw 2 j = r]X 2 j{yi - A22/2) 

AA 2 =r/o (1-2/1) (5) 

where W\j is the element of weight vector, wi etc. If we consider the general 
problem of maximising correlations between two data sets which may be have 
an underlying nonlinear relationship, we can use some nonlinear function, for 
example tanh(), to train the output neurons. So, the outputs yo, and y4can be 
calculated from: 



2/3 = tanh{vsjXij) = \v3g3 

j 


( 6 ) 


2/4 = '^W 4 j tanh{v 4 jX 2 j) = ’W 4 g 4 


( 7 ) 



j 



The weights V3 and V4 are used to optimise the nonlinearity, which gives us 
extra flexibility in maximising correlations. The maximum correlation between 
t/3 and t/4 was found by maximising the function: 

J = -£'{(2/32/4) + 2 ^ 3 ^^ ~ ~ 
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We may now use the derivative of this function with respect to the weights, W 3 
and W 4 , the Lagrange multipliers, Ai and A 2 and also the weights V3 and V4, to 
derive learning rules for both: 

^wzi = VS 3 i{y 4 - A3 2/3) 

Avzi = ?]xiiW3i(2/4 - A32/3)(1 - gs) 

Aw 4 = rig 4 i{yz - My 4) 

Av4i = »?X2iW4i(y3 - A42/4)(1 - gi) (8) 

We note that both the linear and the nonlinear rules use a two-phase approach 
- the A parameter updates interleave with the updates of the weights. In the next 
section, we will derive a much simplified learning rule and show that the new 
learning rule outperforms this set of rules. 



3 CCA and the Generalised Eigenvalue Problem 



Now it may be shown [1] that an alternative method of finding the canonical 
correlation directions is to solve the generalised eigenvalue problem 



0 


N’12 


Wi 




'Afii 


0 




Wi 


Al’21 


0 


W 2 


= p 


0 


Al’22 




W 2 



where p is the correlation coefficient. Intuitively since = E{xiXj) we are 
stating that W 2 times the correlation between xi and X 2 is equal to the cor- 
relation coefficient times the weighted (by wi) variance of Xi. Now this has 
multiple solutions since we are not constraining the variance of the outputs to 
be 1. If Wj and W 2 are solutions to (9), then so are aw^ and aw 2 for all real a. 
Thus this method will find the correct correlation vectors but not with a unique 
magnitude. 

It has recently been shown [4] that solutions of the generalised eigenvalue 
problem 

Aw = \Bw (10) 



can be found using gradient ascent of the form 



dw 

dt 



Aw — f{w)Bw 



( 11 ) 



where the function /(w) : BA — {0} ^ R satisfies 

1. /(w) is locally Lipschitz continuous 

2. 3Mi > M 2 > 0 : /(w) > AijVw :|| w ||> M\ and/(w) < A„,Vw : 0 <|| 
w ||< M 2 

3. Vw G i?" — {0},3iVi > N 2 > 0 : f(0w) > Ai,V0 : 0 > Ni and/(0w) < 
An, : 0 < 0 < N 2 and f{6w) is a strictly monotonically increasing function 
of 9 in [iVi, 1 ^ 2 ]- 
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Taking w = [w^w|’]^, we find the canonical correlation directions wi and W2 
using 

= I7i 2W2 - /(wi)Z’iiWi 
= i72lWi - /(w2)T’22W2 

Using the facts that Sij = E{xixJ),i, j = 1,2, and that yi = wi.xi, we may 
propose the instantaneous rules 



Z\wi = Xiy2 - /(wi)xij/i 

Z\W2 = X2yi - /(w2)x2t/2 

For example, if we choose /(w) = ln(w^(t)w(t)),we have: 

Awij = r]Xij{y2 - In(wjwi)yi) 

Aw2j = r]X2j{yi - ln(wJw2)j/2) (12) 

This algorithm is simpler then that used previously [2] , in that we don’t need to 
adjust the parameter A any more. In the similar manner, we can use: 



Awj = rjXjiyi - ln(^ | Wj \)yi) 


(13) 


J 

Awj = rjxAyi - ln( max Wj \)yi) 

l<j<n 


(14) 


Awj = rjXjiyi - (w'^w - cl))y^) 


(15) 


Awj = riXj{yi — ( max wj | —4>)yi) 

l<j<n 


(16) 


Awj = yxj{yi - (^ | wj \ -4>)yi) 


(17) 



j 



The functions (12)-(17) will be known as /i(), /eO in the following. These new 
algorithms only have a one-phase operation; there is no additional A parameter 
to update. 

It may be argued however, that the rule (15) is equivalent to the update of 
the A parameter in the previous section. This is only superficially true: firstly, the 
derivations are quite different; secondly, the 4> parameter in (15) must satisfy the 
constraint that it is greater than the greatest eigenvalue of the covariance matrix 
of the input data whereas the rule of the previous section used the equivalent 
parameter to ensure that the variances of the outputs were bounded. The need for 
a larger value of 4> in (15) has been verified experimentally. Nevertheless, we have 
experimented with the functions /i(), ..., /e() in the nonlinear case equivalent to 
5 

V3 = y^W3jtanh{v3jXij) = waga (18) 

j 

2/4 = ''^W4,jtanh{v4,jX4,j) = W 4 g 4 

j 



(19) 
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to get the nonlinear update equations 2/3 and 2/4 used 

Aws = r]gz{yi - f{w)yz) 

Av3i = r]XiiW3i{y4 - f{w)y3){l - gl) 

Aw 4 = r]g4{y3 - f{w)y4) 

Av4i = riX2iW4i{y3 - /(w)y4)(l - fli) (20) 

4 Simulations 

4.1 Artificial Data 

To compare this new family of algorithms with the existing neural algorithm, 
we will use the artificial data reported in [2]. We generated an artificial data 
set to give two vectors Xi and X2. Xi is a 4 dimensional vector, each of whose 
elements is drawn from the zero-mean Gaussian distribution, N(0,1); X2 is a 3 
dimensional vector, each of whose elements is also drawn from N(0,1). In order 
to introduce correlations between the two vectors, xi and X2, we generate an 
additional sample T from N(0,1) and add it to the first elements of each vector 
and then divide by 2 to ensure that there is no more variance in the first elements 
than in the others. Thus there is no correlation between the two vectors other 
than that existing between the first elements of each. All simulations have used 
an initial learning rate of 0.001, which is decreased linearly to 0 over 100,000 
iterations. 

The weights’ convergence is shown in Figures 1 and 2. The convergence is 
given as the angle between the weights at each iteration in our simulation and 
that of the optimal set of weights i.e.(l,0,0,0) for xi and (1,0,0) for X2. In each 
figure, we graph two lines to express the convergence of wi and W2. Comparing 
Figures 1 and 2, we can see the new algorithm converges much faster than the 
existing algorithm and is very stable. All of the learning algorithms of our class 
have the same order of convergence speed, which means the order of convergence 
speed does not depend on the specific form of /(w). The learning algorithms are 
robust to implementation error on /(w). The simple /(w) reduces the imple- 
mentation complexity. 

4.2 Real Data 

Again, in order to compare our family of methods with those reported earlier, 
we use data taken from [3]. The data set consists of 88 students who sat 5 exams, 
2 of which were closed book exams while the other 3 were open book exams. 
Thus we have a two dimensional xi and a three dimensional X2. To illustrate 
the non-uniqueness of the correlation vectors and the effect of the parameter we 
show in Table 1 a set of results. In our experiment, the learning rate was 0.0001 
and the iterations was 50000. 

In Table 1, wi vector consists of the weights from the closed-book exam data 
to yi while the W2 vector consists of the weights from the open-book exam data 
to 2/2- We note the excellent agreement between the methods. 
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Standard statistics 


0.6630 


maximum correlation 




Wi 


0.0260 0.0518 


W2 


0.0824 0.00081 0.0035 


Existing neural network 


0.6962 


maximum correlation 




Wl 


0.0260 0.0518 


W2 


0.0824 0.0081 0.0035 


New neural network 


0.6790 


maximum correlation 




Wl 


0.0270 


W2 


0.0810 0.0090 0.0040 



Table 1. Correlation and Weight Value of the Real Data Experiment 



4.3 Random Dot stereograms 

Becker (1996)has developed the idea about that one of the goals of sensory 
information processing may be the extraction of common information between 
different sensors or across sensory modalities. Becker experimented this idea on 
a data set which is an abstraction of random dot stereograms, just like Figured . 
The two different neural units should learn to extract features that are coherent 
across their inputs. If there is any feature in common across the two inputs, it 
should be discovered, while features that are independent across the two inputs 
will be ignored. We wish to find the maximum linear correlation between yi and 
y 2 which are themselves linear combinations of xi and X 2 . In order to find these, 
we require two pairs outputs and the corresponding pairs of weights(wi,W 2 ) and 
(w 3 ,W 4 ). The learning rules for and Wi in this experiment are analagous to 
those for wi and W 2 ; at each presentation of a sample of input data, a sample 
competition between the products yiy 2 and ysy 4 determine which weights will 
learn on the current input samples: if yiy2 V yay4, wi,W2 are updated, else W3, 
W 4 are updated. 

Using a learning rate of 0.001 and 100000 iterations with /i(), the weights 
converge to the vectors shown in Table 2. The first pair of withts Wi and W 2 



Wl 


-0.004 


1.649 


-0.004 


0.000 


W2 


0.000 


-0.004 


1.649 


-0.003 


W3 


0.000 


0.010 


1.649 


0.016 


W4 


0.016 


1.649 


0.018 


0.000 



Table 2. The converged weights of Random dot stereograms 



have identified the second element of xi and the third element of X 2 as having 
maximum correlation while other inputs are ignored (the weithts from these are 
approximately 0). This corresponds to a right shift. This first pair of outputs has 
a (sample) correlation of 0.542. Similarly the second pair of weights has identified 
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the third element of xi and the second element of X 2 as having maximum corre- 
lation while other inputs are ignored. The second pair has a (sample)correlation 
of 0.488 and corresponds to an identification of left shift. 



5 Nonlinear correlations 

two data 
the data 
data ac- 
cording to the prescription: 



We investigate the general problem of maximising correlations between 
sets when there may be an underlying nonlinear relationship between 
sets: to compare with previous experiments [2], we generate artificial 



x\i = sind + (21) 

xi 2 = cos6 + /X 2 (22) 

X21 = d — TT + fj,3 (23) 

X22 = 0 - TT + fj,4 (24) 



where 9 is drawn from a uniform distribution in [0,27 t] and /ii,i=lv:4 are drawn 
from the zero mean Gaussian distribution N(0,0.1). Equations (21)and(22)define 
a circular manifold in the two-dimensional input space while Equations (23) and 
(24) define a linear manifold within the input space where each manifold is only 
approximate due to the presence of noise(/ii,i=l,...,4)- The subtraction of tt in 
the linear manifold equations is merely to centre the data. Thus xi = (a;i 1 , 3 : 12 ) 
lies on or near the circular manifold xh + X 12 = 1 while X 2 = (xn, 0 : 12 ) lies on 
or near the line 3:21 = 3 : 22 . We wish to test whether the new algorithm can find 
correlations between the two data sets and the test whether such correlations 
are greater than the maximum linear correlations. To do this we train two pairs 
of output neurons: 

• we train one pair of weights wi and W 2 using rules (12); 

• we train a second pair of outputs, 2/3 and 2/4 which are calculated using 19 
The correlation between yi and 1/2 neurons was maximised using the previous 
linear operation (12)while that for 1/3 and 2/4 are maximised using 20. 

We use a learning rate of 0.001 for all weights and learn over 100 000 iter- 
ations. We did not attempt to optimize any parameters for the existing or the 
new algorithm. In the linear case, the network finds a linear correlation between 
the data sets similar to that found by the Lai and Eyfe method. However in the 
nonlinear case our family of networks find a greater correlation. 





existing fi() f 2 () fa)) f4() fsO fe() 


linear 

nolinear 


0.452 0.444 0.452 0.452 0.452 0.452 0.452 

0.702 0.859 0.859 0.859 0.863 0.831 0.814 



Table 3. linear and nonlinear correlations 
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In Table 3, we can see the new algorithms find the same correlation as the 
existing algorithms in the linear case and can find greater nonlinear correlations 
than the existing algorithm. 

6 Conclusion 

We have derived a neural implementation of Canonical Correlation Analysis and 
shown that it is much more effective than previous neural implementations of 
CCA. In particular, it 

— is much faster than previous algorithms 

— is much less dependent on selection of optimal parameter values than previ- 
ous methods. 

The new methods constitute a very robust family of neural implementations of 
CCA. 

We have also extended the method to nonlinear correlations and shown that 
this family of methods robustly finds greater correlations than is possible with 
linear methods. 
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Abstract. In this paper, a novel approach to knowledge discovery is 
proposed based on the integration of kernel principal component analysis 
(KPCA) with an improved evolutionary algorithm. KPCA is utilized to first 
transform the original sample space to a nonlinear feature space via the 
appropriate kernel function, and then perform principal component analysis 
(PCA). However, it remains an untouched problem to select the optimal kernel 
function. This paper addresses it by an improved evolutionary algorithm 
incorporated with Gauss mutation. The application in fault diagnosis shows 
that the integration of KPCA with evolutionary computation is effective and 
efficient to discover the optimal nonlinear feature transformation 
corresponding to the real-world operational data. 



1 Introduction 

In last decade, knowledge discovery, or data mining has attracted more and more 
attention in the field of automatic information processing. Although it is still in the 
infancy, knowledge discovery has been successfully applied in many areas, such as 
market data analysis [1], process monitoring and diagnosis [2], and financial 
engineering. In these areas, there are mountains of data collected every day. Flow to 
automatically analyze such volumes of data and then to make appropriate decisions 
remains a hard problem. Knowledge discovery highlights the techniques that can 
extract knowledge from data. There are many technologies, such as neural networks 
[3], rough sets [4] and genetic algorithms [5], [6], individual or synergistic, applied 
in knowledge discovery. 

As we know, principal component analysis (PCA) has played an important role 
in dimensionality reduction, noise removal, and feature extraction of the original 
data sets as a pre-processing step in knowledge discovery. It is implemented with 
ease and effective in most common cases. However, for some complicated cases in 
industrial processes, especially nonlinearity, PCA exhibits bad behavior because of 
its linearity nature. A modified PCA technique, called kernel principal component 

K.S. Leung, L.-W. Chan, and H. Meng (Eds.): IDEAL 2000, LNCS 1983, pp. 174-179, 2000. 
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analysis (KPCA) [7], [8], has been emerging to tackle the nonlinear problem in 
recent years. KPCA can efficiently compute principal components in high- 
dimensional feature spaces by the use of integral operator and nonlinear kernel 
functions. For a given data set, it is the key step to select a corresponding optimal 
kernel function in order to obtain the optimal nonlinear feature transformation. This 
paper addresses the unsolved problem using evolutionary computation. 

Evolutionary computation, mainly composed of four branches: genetic 
algorithms (GA), genetic programming (GP), evolution strategies (ES), and 
evolutionary programming (EP), has developed to be a powerful intelligent 
optimization tool employed extensively in complex real-world problems, such as 
face recognition, structure design, and shop job scheduling. The paper proposes an 
improved evolutionary algorithm, which incorporates with Gauss mutation operator. 
Afterwards, it is used to solve the optimal selection of kernel functions when we 
perform KPCA. 

The organization of the paper is as follows. The mathematical basis and the 
relationship between PCA and KPCA are reviewed in Section 2. An improved 
evolutionary algorithm is detailed in Section 3. Then, the integration of KPCA and 
the improved evolutionary algorithm for feature extraction in fault diagnosis is given 
in Section 4. At last, the conclusions are summarized in Section 5. 



2 PCA and KPCA 



2.1 Principal Component Analysis 



PCA is an orthogonal transformation technique of the initial coordinate system in 
which we describe our data. The transformed new coordinate system can describe 
the data using a small number of principal components (PCs) while retaining as 
much of the variation as possible. Thus, it can extract effective features from 
original variables, especially for redundant, correlated variables. 

Given a known data matrix D, representing M observations of N variables as 

Xji ... X|jy 

Xji ^22 ... Xj,, ■ (1) 



\_^Ml ^M2 J 

Denote one observation as a column vector x, =lx,, x,„ x, k=l,2,'",M, 
where x.er", "V“x =0A-e- the mean value of each variable is set to zero. The 

k k 

superscript T denotes the transpose operation of the vector. Thus, PCA, in essence, is 
to diagonalize the covariance martrix 



1 M 

c=— y 

S/f J J 






(2) 
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Then solve the eigenvalue equation 

Av = Cv , (3) 

where A is the eigenvalue and V is the corresponding eigenvector. 

We can obtain the solutions to Eq. (3) by the method of linear algebra. Then we 
obtain the PCs, which are linear combinations of the original variables. 



2.2 Kernel Principal Component Analysis 

PCA exhibits a good performance when dealing with linear problems. However, for 
complex nonlinear problems, PCA can not exhibit such a good performance. Kernel 
principal component analysis (KPCA) is an emerging technique to address the 
nonlinear problems on the basis of PCA. KPCA is a superset of PCA, and it involves 
two major operations: first mapping the original data space to a feature space via the 
pre-selected kernel function, and then performing PCA in that feature space. 

For the given data matrix D in Eq. (1), the covariance matrix C in KPCA now is 
altered to 

where is the mapping function from A^-dimensional data space to 

certain feature space F. 

So, the eigenvalue equation is expressed as: 

Av = Cv. (5) 

If we perform the mapping with explicit expression of cfr.R'^^F, then in 
general, we will face the puzzle of ‘curse of dimensionality’. It can be solved by 
computing dot products in feature space via kernel function [8]. From Eq. (4) and 
(5), we can obtain: 

A(0(x,)-V) =(0(x,)-CV) k=l,2,...,M. (6) 

The eigenvector V can be expressed as: 

v=i«,0(xp. (7) 

7=1 

where a = (ai,a 2 ,---,oc„ )^is the coefficient column vector. 

Combine Eq. (4), (6), and (7), we obtain: 

M t M M 

/=i ^ j=i (=1 

A: = 1,2,...,M 

If we define a square matrix K with M rows and M columns as 

= (<^(T)-(0(-*p)> (9) 

then Eq.(8) can be simplified as 

MAKa = 7Ca. (10) 

In order to obtain the solutions to Eq. (10), we can solve the eigenvalue problem, 

M?M=Ka- ( 11 ) 
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3 An Improved Evolutionary Algorithm 

The most attractive advantage of evolutionary algorithms, such as genetic 
algorithms is their global search ability. However, they can not guarantee every time 
the best individual in the population of every generation can converge to the global 
optimum because of the stochastic operations. Indeed, the problem of premature 
convergence often occurs in evolutionary algorithms. In [9], Miller etc. proposed 
three types of local search operators to alleviate the premature convergence. 
However, they are only fit for binary-coded genetic algorithms. Ackley [10] 
integrated hill climbing with genetic algorithms, which can improve the local search 
ability. However, the integration requires computing the gradient of the objective 
function, thus losing the main advantage of evolutionary algorithms. 

In addition to the commonly used genetic operators designed by Michalewicz 
[11], this paper presents a novel mutation operator- Gauss mutation. It is defined as 
follows: 

aj = k-Min{x‘-'-aj,bj-x’-'](l — (j = l>2,...,M)> (12) 

where k is a constant within the closed interval [0,1]; t is the generation; X j * is the 

j-th variable to be optimized in the (t-l)th generation; [aj,bj] is the j-th variable’ s 
scope; Mg is the maximum generation; s is a shape parameter; and M is the number 

of variables. 

The mutation of the j-th variable Xj is expressed as 

X. =x.+£, (y = l,2,...,M) (13) 

e^~N{Q,o.) (14) 

where is distributed as a Gauss random variable with zero mean and Gj 
standard deviation. 

The improved evolutionary algorithm, which incorporates Gauss mutation 
operator, can accelerate the convergence rate. This will be shown in next section. 



4 Application in Fault Diagnosis 

There are many types of functions used as kernel functions in KPCA, such as, Gauss 
radial basis function: 

^(Xi,Xi) = exp(-Ll-^L), (15) 

2(7 

where G is the standard deviation required to be predetermined. We use the 
improved evolutionary algorithm described in Section 3 to obtain the optimal 
parameter G . So, the fitness criterion is firstly required for evaluating the 
individuals. 
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In the field of fault diagnosis for large-scale rotating machinery, the 
classification of commonly encountered faults is the core problem. Its complexity 
varies with the real world situations. In general, we can measure the class- 
separatability by the trace of a scatter matrix. Therefore, we design the fitness 
function as: 

p. _ tr(S,) (16) 

' nsj 

where 5'^ and are the scatter matrices of between-class and within-class, 
respectively. The bigger the f, , the better the class-separability. 

Here, we use three typical faults: misalignment, unbalance and rub as an 
example. The original data samples are 8 holospectrum parameters for 15 
observations, [12]. Figures 1~3 are the classification results depicted by the first and 
the second principal components as horizonal and vertical axes, respectively. And 
Figure 4 shows the fitness curves of Michalewicz genetic algorithms(Mich-GA) and 
the improved evolutionary algorithms (Imp-GA). 




-3-2-10123 -3-2-10123 



Initial feature 1 PCA PC-1 

Fig. 1. Initial features Fig. 2. The first two principal components 

obtained by PCA 




KPCA PC-1 Generation 



Fig. 3. The first two principal components Fig. 4. Fitness curves 

obtained by KPCA 

In Figures 1~3, the symbols: “o”, and stand for the sample points for 

misalignment, unbalance, and rub. 
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From Figure 2 and Figure 3, we can find that KPCA can significantly improved 
the class-separability than PCA when integrated with the improved evolutionary 
algorithm. Figure 4 shows Imp-GA has more rapid convergence rate than Mich-GA. 



5 Conclusions 

This paper mainly concentrates on the automatic discovery of the optimal nonlinear 
feature transformation by KPCA, which is the improved version of PCA, by 
employing an improved evolutionary algorithm, which incorporates the Gauss 
mutation operator. The application example in fault diagnosis shows that the class- 
separatability is improved significantly by integrating KPCA with the improved 
evolutionary algorithm. 
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Abstract. This study proposes a data mining framework to discover 
qualitative and quantitative patterns in discrete- valued time series(DTS). 
In our method, there are three levels for mining temporal patterns. At the 
first level, a structural method based on distance measures through poly- 
nomial modelling is employed to find pattern structures; the second level 
performs a value-based search using local polynomial analysis; and then 
the third level based on multilevel-local polynomial models(MLPMs), 
finds global patterns from a DTS set. We demonstrate our method on 
the analysis of “Exchange Rates Patterns” between the U.S. dollar and 
Australian dollar. 



1 Introduction 

Discovering both qualitative and quantitative temporal patterns in temporal 
databases is a challenging task for research in the area of temporal data mining. 
Although there are various results to date on discovering periodic patterns and 
similarity patterns in discrete-valued time series (DTS) datasets (e.g. [5,2,3]), 
a general theory and method of data analysis of discovering patterns for DTS 
data analysis is not well known. 

The framework we introduce here is based on a new model of DTS, where the 
qualitative aspects of the time series are analysed separately to the quantitative 
aspects. This new approach also allows us to find important characteristics in 
the DTS relevant to the discovery of temporal patterns. The first step of the 
framework involves a distance measure function for discovering structural pat- 
terns (shapes). In this step, the rough shapes of patterns are only decided from 
the DTS and a distance measure is employed to compute the nearest neighbors 
(NN) to, or the closest candidates of, given patterns among the similar ones 
selected. In the second step, the degree of similarity and periodicity between the 
extracted patterns are measured based on local polynomial models. The third 
step of the framework consists of a multilevel-local polynomial model analysis 
for discovering all temporal patterns based on results of the first two steps which 
are similarity and periodicity between the structure level and pure value level. 
We also demonstrate our method on a real-world DTS. 

K.S. Leung, L.-W. Chan, and H. Meng (Eds.): IDEAL 2000, LNCS 1983, pp. 180-186, 2000. 
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The rest of the paper is organised as follows. Section 2 presents the defini- 
tions, basic methods and our new method of multilevel-local polynomial mod- 
els(MLPM). Section 3 applies new models to “Daily Foreign Exchange Rates” 
data. The last section concludes the paper with a short summary. 

2 Definitions and Basic Methods 

We first give a definition for what we mean by DTS, and some other notations 
will be seen later. Then the basic models and our new method which we called 
multilevel-local polynomial model will be given and studied in detail in the rest 
of the paper. 

2.1 Definitions and Properties 

Definition 1 Suppose that {fl, F, E} is a probability space, and T is a discrete- 
valued time index set. If for any t E T, there exists a random variable ^t(^) 
defined on {fl, F, E}, then the family of random variables {^(w), t eT} is called 

a discrete-valued time series(DTS). 

The random variables t G T in the above definition should be under- 

stood as complex-valued variables in general, and in a sequel, a succinct form of 
stochastic process is {^t{(^),t G T}, the element uj will be omitted. 

In a DTS, we assume that for every successive pair of two time points: tj+i - 
ti = f{t) is a function of time. For every successive three time points: Xj, Xj+i 
and Xj+ 2 , the triple value of (Yj, Fj-i-i, hj-H 2 ) has only nine distinct states (or, 
called nine local features). If we let states: Sg is the same state as prior one, 
is the go-up state compare with prior one and Sd is the go-down state compare 
with prior one, then we have state-space S = {si, s2, s3, s4, s5, s6, s7, s8, s9} = 
{(Tj, 5,), {Yj, 5,), {Yj, Sd), {Yj, 5,), {Yj, 5,), {Yj, Sd), 

{Yj, Sd, 5,), {Yj, Sd, Sg), {Yj, Sd, Sd) }. 

Definition 2 If let h = {hi, /i 2 , • • •} be a sequence. If for every hj G h, hj G S, 

then the sequence h is called a Structural Base sequence. Let y = {yi,y 2 , } 

be a real value sequence, then y called a value-point process. 

A sequence is called a full periodic sequence if its every point in time con- 
tributes (precisely or approximately) to the cyclic behavior of the overall time 
series (that is, there are cyclic patterns with the same or different periods of 
repetition). A sequence is called a partial periodic sequence if the behavior of the 
sequence is periodic at some but not all points in the time series. 

We have the following results ^ : 

Lemma 1. If let h = {hi, /i 2 , • • •} be a sequence and every hj G h, hj E S. If h 
is a periodic sequence, then h is a structural periodic sequence ( existence periodic 
pattern(s)). 



^ The proofs are straightforward from above definitions 
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Lemma 2. A discrete-valued data set contains periodic patterns if and only if 
there exist structural periodic patterns and periodic value-point processes with or 
without an independently identical distribution (i.i.d.). 

Lemma 3. A discrete-valued data set contains similarity patterns if and only if 
there exist structural periodic patterns and exist similarity value-point processes 
with or without an independently identical distributionfi.i.d.). 



2.2 Local Polynomial Models (LPMs) 

The key idea of local modelling is explained in the context of least squares regres- 
sion models. We use standard results from the local polynomial analysis theory 
which can be found from the literature on linear polynomial analysis(e.g., [4]). 
In our data mining method, we only consider the structural relationship between 
the response variable Y and the vector of covariates X = (t, Xi, . . . , For 

a given dataset, we can regard the data as being generated from the model 

Y = m(X) cr(X)e, 

where E( e) = 0, Var( e) = l,and X and e are independent We approximate 
the unknown regression function m(x) locally by a polynomial of order p in a 
neighbourhood of Xq, 



"i(x) « y^(x-xo)J = ^/3j(x-xo)T 

j=o j=o 

This polynomial is fitted locally by a weighted least squares regression problem: 

n p 

minimize{^{yi - ^/3j(Xy - x^y^KsCXi - x)}, 
i=l j=0 

where S is often called bandwidth, and Kg{-) with K a kernel function assigning 
weights to each datum point 



2.3 Multilevel-Local Polynomial Models(MLPMs) 

For building up our new data mining model, we divide the data sequence or, 
data vector sequence into two groups: (1) the structural-base data group and, 
(2) the pure value-based data group. 

^ We always denote the conditional variance of Y given X = xq by <t^(xo) and the 
density of X by /(•) 

® In section 3, we choose Epanechnikov kernel function: K{z) = |(1 — for our ex- 
periments in pure-value pattern searching and using the distance function in struc- 
ture pattern searching, which minimises the asymptotic MSE of the resulting local 
polynomial estimators. 
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In group one, we only consider the data sequence as a 9-state structural 
sequence and use squared distance functions which are provided by a class of 
positive semidefinite quadratic forms. For example, u = (mj , Mg, ■ • • , Up) denotes 
the p-dimensional observation of each different distance of patterns in a state 
on an object that is to be assigned to one of the g prespecified groups, then, for 
measuring the squared distance between u and the centroid of the fth group, we 
can consider the function [1] D‘^{i) = (u — y) M(u — y) where M is a positive 
semidefinite matrix to ensure the D^(i) > 0. 

On the value-point pattern discovery, we use local polynomial techniques on 
a given the bivariate data (Xi,Fi), (X„,y„). We can replace the weighted 
least squares regression function in section 2.2 by 

n p 

- Xo) 

i=l j=0 

where £{■) is a loss function. For the purpose of predicting future values, we use 
a special case of the above function with £a{t) = |t| -I- (2 q! — l)t. 

Then we combine those two local polynomial models to obtain final results. 

3 Experimental Results 

There are three steps of experiments for the investigation of “Exchange Rates 
Patterns” between the U. S. dollar and the Australian dollar . The data consist 
of daily exchange rates for each business day between 3 January 1994 and 9 
August 1999. The time series is plotted in figure 1. All experiments were done on 
our Unix system and Windows NT 4.0; prototype was written in Awk language. 

Modelling DTS: Since the difference between every successive pair of time 
points in this DTS is a constant: U+i - U = c, we may view the structural base as a 
set of vector sequence X = {Xi, • • ■ , X^}, Xj = (si, s2, s3, s4, s5, s6, s7, s8, s9)^ 
denotes the 9 dimensional observation on an object that is to be assigned to a 
prespecified group. Then the structural pattern searching becomes local linear 
regression in multivariate setting. We may also view the value-point process data 
as bivariate data (Xi, Yi), ■ • (X„, y„). This is one-dimensional local polymonial 

modelling. So the pattern searching problem in a DTS can be formulated as 
multilevel local polynomial analysis. 

Dealing with Structure: We are investigating the sample of the struc- 
tural base to test the naturalness of the similarity and periodicity on Structural 
Base distribution. The size of this discrete-valued time series is 1257 points. 
We only consider 8 states in the state-space of structural distribution®: S = 
{si, s2, s3, s4, s6, s7, s8, s9}. 

The Federal Reserve Bank of New York for trade weighted value of the dollar = 
index of weighted average exchange value of U.S. dollar against the Australian dollar: 
http : //w¥w .f rbchi . org/econinf o/f inance/f inance .html. 

® in this case study, the state 5 (i.e, s5) is an empty state 
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Fig. 1. 1409 business exchange rates between the U.S. dollar and the Australian dollar. 




Fig. 2. Left: plot of the distance between same state for all 8 states in 1257 business 
days. Right: plot of the distance between same state for all 8 states in first 50 business 
days. 



In Figure 2, the a-axis represents the distances between patterns and the 
t/-axis represents different patterns (e.g., the pattern (si, s2) is represented at 
point 12 on the j/-axis). This explains two facts: (1) there exists a hidden peri- 
odic distribution which corresponds to patterns on the same line with different 
distances, and (2) there exist partial periodic patterns on and between the same 
lines. To explain this further, we can look at the plot of distances between of 
the patterns at a finer granularity over a selected portion of the daily exchange 
rates. For example, in the right of Figure 2 the dataset consists of daily exchange 
rates for the first 50 business days, with each point representing the distance of 
patterns of various forms. And between some combined pattern classes there 
exist some similar patterns. 

In summary, some results for the structural base experiments are as follows: 
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- Structural distribution is a hidden periodic distribution with a periodic 
length function f{t) (there are techniques available to approximate to the 
form of this function such as higher-order polynomial functions). 

- There exist some partial periodic patterns based on a distance shifting. 

- For all kinds of distance functions there exist a cubic curve: y = a + ^ + ^, 
where a, c and a: > 0, 6 < 0. 

Dealing with Values: We now illustrate our method to analyse and to 
construct predictive intervals on the value-point sequence for searching patterns. 
The linear regression of value-point of Xt against Xt-i explains about 99% of 
the variability of the data sequence, but it does not help us much in analysis and 
predicting future exchange rates. In the light of our structural base experiments, 
we have found that the series Yt = Xt — Xt -2 has non-trivial autocorrelation. 
The correlation between 1) and Yt-\ is 0.4898. Our analysis is focused on the 
series 1), which is presented in the left of Figure 3. It is scatter plot of lag 2 
differences: Yt against Yt-i. We obtain the exchange rates model according to 
the nonparametric quantile regression theory: 



Yt = 0A732Yt-i + et 

From the distribution of et, the e{t) can be modelled as an AR{2) 

£t = 0.2752£(_i — 0.413l£i_2 + Ct 

with a small V ar{et) (about 0.00041) to improve the predictive equation. 

For prediction of future exchange rates for the next 150 business days, we 
use the simple equation 1) = 0.4732y)_i with an average error -0.000057. In the 
right of Figure 3 the actually observed series and predicted series are shown. 
Some results for the value-point of experiments are as follows: 




Fig. 3. Left: Scatter plot of lag 2 differences: Yt against i7_i. Right:Plot of future 
exchange rates only for 150 business days by using the simple equation Yt = 0.4732 
Yt-i 
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- There does not exist any full periodic pattern, but there exist some partial 
periodic patterns based on a distance shifting. 

- There exist some similarity patterns with a small distance shifting. 

Using Multilevel Local Polynomial Model: By embedding pure value 
results of local polynomial model into structure pattern model, we have the 
following combined results on exchange rates as follows: 

- There does not exist any full periodic pattern but there exist some partial 
periodic patterns. 

- There exist some similarity patterns with a small distance shifting. 

- Temporal patterns can be predicted only for near future (in a window of a 
few months) based on past data. 

4 Conclusion 

This paper has presented a multilevel-local polynomial model for finding pat- 
terns in discrete-valued time series. The method guarantees finding different 
patterns if they exist with structural and valued probability distribution of a 
real-dataset. The results of preliminary experiments are promising and we are 
currently applying the method to large realistics data sets. 
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Abstract. We consider the problem of mining sequential patterns over 
several large databases placed at different sites. Experiments carried out 
on synthetic data generated within a simulation environment are re- 
ported. We use several agents capable to communicate with a temporal 
ontology. 



1 Introduction 

Much effort has been recently put into trying to turn the data available in the 
enormous amount of databases around us into useful knowledge [7], among other 
means, by; 

a multistrategy methodology for conceptual data exploration, by which 
we mean the derivation of high-level concepts and descriptions from data 
through symbolic reasoning involving both data and background knowl- 
edge. 

There are many current research issues in this area including learning over mul- 
tiple databases and the World Wide Web and [8]: 

Optimizing decisions rather than predietions. The goal here is to use 
historical data to improve the choice of actions in addition to the more 
usual goal of predicting outcomes. 

Making rational decisions requires knowledge of the phenomena in the applica- 
tion domain and sequential patterns play an important role. Mining sequential 
patterns over a large database of customer transactions was aimed at first to pro- 
vide valuable information to businesses, such as customer buying patterns and 
stock trends [1, 5, 9]. Mining generalized sequential patterns [9] has been given a 
more efficient solution [6]. Temporal features in the mining models or processes 
can provide accurate information about an evolving business domain [2] , but can 
also benefit other application areas. 

Another approach to face distribution is by the deployment of various agents, 
eventually specialized on various tasks working together toward a common high- 
level goal [3,10]. We describe a model and operations for the process of dis- 
tributed database mining by multi-agents and report some experiments showing 
how the basic operations we propose perform on synthetic data. 

K.S. Leung, L.-W. Chan, and H. Meng (Eds.): IDEAL 2000, LNCS 1983, pp. 187-192, 2000. 
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2 Information Marketplace Environment 

The scenario we have chosen to study the strategy of agents is an information 
marketplace with several customers collecting infons that appear and disappear. 
Infons are carriers of information characterized by its utility value varying in the 
range 1 to 4. The customers have various interests shown by the utility value 
of the infons they collect. A sample of our synthetic world showing possible 
changes in the marketplace for the time steps r=0,l,2,3 is illustrated in hgure 1. 
All infons appear randomly, one in a square of the grid, with a life span described 
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Fig. 1. Segment of the information marketplace at times (a) t= 0, (b) t=1, (c) t= 2 
and (d) t=3. 



by the sequence 




for infon ik, that is its utility increases in time up to the maximum and then 
decreases again until 1, and in the next time step it disappears. 

2.1 Mining Task 

We assume that have customer Co interested in infons of utility 3 and 4, customer 
Cl interested just in infons of maximum utility, customer C2 interested in infons 
of increasing utility value and customer C3 interested in infons of decreasing 
utility. The transactions that have been realized over the time interval [0..3] 
by customers Cq, C*!, C 2 , C 3 in the segment of the information market world 
depicted by figure 1 are shown in table 1. The transactions that have been 
realized over time on a location of information market are saved into a log file. 
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Table 1. Data-sequence database example. 



Customer Time Items Customer Time Items 



Co 
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Co 
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Table 2. Log file example. 



Time Items Customer 



0 


ii 


Co, Co 


1 


ii 


Co 


2 


ii 


Co 


3 







In table 2 are shown transactions for infon i\ location from figure 1. From these 
distributed log files we intend to extract the customers behavior by distributed 
mining. 

2.2 Conceptual Agent 

Our agents are based on the following algorithm for discovery of frequent episodes 
in event sequences [4, 5]. This algorithm has two phases, generation of candidates 
and validation of generated candidates: 

Input: a set E of event types, an event sequences S over E, 
a window width Win, and a frequency threshold Min. 

Output: The collection R(S,Win,Min) of frequent episodes. 

Algorithm: 

L : = 1 ; // current length 

CurrCandidates := generation of candidate episodes with length L; 
while (CurrCandidates != {}■) do 

CurrResult := validation of CurrCandidates based on Min; 

L := L +1; 

//Result will contain the frequent episodes with all values of L 
Result : = Result U CurrResult ; 

CurrCandidates := generation of candidate episodes with length L; 
end 

output Result ; 
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The algorithm starts from the most general episodes, i.e. episodes with only 
one event (L = 1) and continues until it cannot generate new candidates. On 
each level the algorithm first computes a collection of candidate episodes, and 
then checks their frequencies from the event sequence database. The crucial 
point in the candidate generation is given by the following lemma: If an episode 
is frequent in an event sequence, then all sub-episodes are frequent. The main 
idea of validation phase is to move the window Win over the event sequence S 
and count the appearances of a candidate. When these appearances are more 
than given threshold Min, the candidate is validated. The algorithm works with 
two kind of episodes: 

- serial episodes noted as S{AB): it occurs in a sequence only if there are 
events of types A and B that occur in this order in the sequence. 

- parallel episodes noted as P{AB): no constraints on the relative order of A 
and B are given. 

The generation phases are the same for both event types, but validation phases 
are different. Serial candidate episodes are validated by using state automata 
that accept the candidate episodes and ignore all other input. 

3 Multi- Agent System 

Our agents perform a distributed data mining. Each agent analyses a site and 
can cooperate with other agents. Cooperation is possible because they use an 
any-time algorithm for data mining. The algorithm can be suspended and then 
resumed, after each phase of length L. Communication between agents are based 
on simple messages, formed as a combination of serial and/or parallel episodes 
(e.g. P(S(FE)S(BC))). They don’t transfer large data, but only results of their 
mining. The agents must use the same ontology for event symbols. Also the 
agents keep the episodes extracted from a site and then can use them on other 
site. A multi-agent for data mining avoid the expensive operation to move the 
whole data on a site. 

4 Experimental Results 



Table 3. Log File fragment 



Time Infon Customer Time Infon Customer 



2 


A 


C4 


28 


B 


Ca 


14 


F 


Ca 


29 


C 


Ci,Ca 


15 


B 


Cl 


30 


A 


C4 


16 


C 


C4 


33 


E 


Ca 


17 


A 


C4 


36 


B 


Cl 
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Our experiments consist of two phases. First we generated synthetic data 
within a simulated environment and then we used data mining agents to analyze 
the data. In data generation phase we used the information marketplace envi- 
ronment with 80X80 grid dimensions. The number of infons remains constant all 
the time. When an infon becomes zero, it moves randomly to another free grid 
location. We defined four kinds of customer behavior, Ci, C 2 , C 3 , C 4 , as a com- 
bination of serial and/or parallel episodes. These behaviors are shown in lines 
with label ’’Generated” from table 4, where A = (4, |), B = (2, |), C = (3, |), 
D = (3,1), E = (1,1), F = (l,t) and a pair (V,T) represents an infon with 
value V and evolution trend T. Executing their behavior the customer accesses 
different grid location. For each grid location we generated a log file. A log file 
fragment is shown in table 3. These distributed log files are inputs for data min- 
ing agents. Table 4 shows what data mining agents have extracted from three 
different log files. 



Table 4. Types of behavior 



Behavior Behavior Description (generated and extracted) 
Type 



Cl Generated: S(BAC) 

Cl Logl: P(C), P(A), P(B), S(C), S(A), S(B), S(CB), S(AB) 

Cl Log2: P(C), P(A), P(B), S(C), S(A), S(B), S(CC), S(CA), S(CCC), 

S(CCA), S(CCCA) 

Cl Log3: P(C), P(A), P(B), S(C), S(A), S(B), S(AA) 



C2 Generated: S(CAD) 

C2 Logl: P(C), P(A), P(D), P(CA), P(CD), P(AD), S(C), S(A), S(D), 

S(CD), S(AD) 

C2 Log2: P(C), P(A), P(D), P(CA), P(AD), S(C), S(A), S(D) 

C2 Log3: P(C), P(D), P(CD), S(C),S(A), S(D), S(CC), S(CD), S(AD) 

C2 Log3: P(C), P(D), P(CA), S(C^S(A), S(D), S(Cci S(CD^ S(AD) 

C3 Generated: P(S(FE)S(BC)) 

C3 Logl: P(B), P(C), P(E), P(F), P(BC), P(BE), P(CE), P(EF), P(BCE), 
S(B), S(C), S(E), S(F) 

C3 Log2: P(B), P(C), P(E), P(F), P(BC), P(BE), P(CE), P(EF), P(BCE), 
S(B), S(C), S(E), S(F) 

C3 Log3: P(B), P(C), P(E), P(F), P(CE), P(EF), S(B), S(C), S(E), S(F) 

C3 Log3: P(B), P(C), P(E), P(F), P(BC), P(BE), P(CE), P(EF), S(B), S(C), 

S(E), S(F) 



C4 Generated: P(CA) 

C4 Logl: P(C), P(A), P(CA), S(C), S(A), S(CC) 

C4 Log2: P(C), P(A), P(AC), S(C/ S(A), S(AA), S(AC), S(AAA) 

C4 Log3: P(C), P(A), P(CA), S(C), S(A), S(AA) 
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Only customer behavior C 4 has been completely extracted. The other behav- 
iors has been partially extracted. At local level extracted behaviors are correct, 
but at information marketplace level these behaviors are incomplete. The local 
information are incomplete, so it is necessary a cooperation mechanism between 
agents. 

5 Conclusions 

It has been shown that meta-learning can improve accuracy while lowering per 
transaction losses [10]. Therefore distribution of mining can reduce significantly 
communication and contribute to the scaling up of various methods. Intelligent 
agents should be able to take advantage of higher-level communication on their 
common mining tasks by using an appropriate ontology. This is our next step in 
the search of more flexible mining on distributed sites. 
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Abstract - A hybrid nonlinear time series predictor that consists a nonlinear sub- 
predictor (NSP) and a linear sub-predictor (LSP) combined in a cascade form is 
proposed. A multilayer neural network is employed as the NSP and the algorithm 
used to update the NSP weights is Lyapunov stability-based backpropagation 
algorithm (LABP). The NSP can predict the nonlinearity of the input time series. The 
NSP prediction error is then further compensated by employing a LSP. Weights of the 
LSP are adaptively adjusted by the Lyapunov adaptive algorithm. Signals' stochastic 
properties are not required and the error dynamic stability is guaranteed by the 
Lyapunov Theory. The design of this hybrid predictor is simplified compared to 
existing hybrid or cascade neural predictors [l]-[2]. It is fast convergence and less 
computation complexity. The theoretical prediction mechanism of this hybrid 
predictor is further confirmed by simulation examples for real world data. 

1 Introduction 

Financial forecasting is an example cf a signal procesang probfcm which is 
challenging due to small sample sizes, high noise, non-stationarity and non-linearity. 
Neural networks (NN) have been very successful in a number of signal processing 
applications including the prediction of the real-world data e.g. sunspot time series [5] 
andfina ncialforecasting [6]. Majttypes cf NlSfetmcturesha vebeenintroducedin[7] 
and this is a good reference for the time series analysis or prediction. In practice, many 
of the time series include both nonlinear and linear properties and the amplitude of the 
time series are usually continuous. Therefore it is useful to use a combined structure of 
linear and nonlinear models to deal with such signals. 

In this paper, we propose a hybrid predictor with Lyapunov adaptive algorithms. It 
consists of the following sub-predictors: (1) A NSP, which consists of a multilayer 
neural network (MLNN) with a nonlinear hidden layer and a linear output neuron. The 
algorithm used to update the weights is LABP [8]. (2) A LSP, which is a conventional 
finite-impulse-response (FIR) filter. Its weights are adaptively adjusted by the 
Lyapunov adaptive algorithm [9]. The NSP that includes nonlinear functions can 
predict the nonlinearity of the input time series. However the actual time series 
contains both linear and nonlinear properties, hence the prediction is not complete in 
some cases. Therefore the NSP prediction error is further compensated for by 
employing a LSP after the NSP. In this paper the prediction mechanism and the role of 
the NSP and LSP are theoretically and experimentally analyzed. The role of the NSP is 
to predict the nonlinear and some part of the linear property of the time series. The 
LSP works to predict the NSP prediction error. Lyapunov functions are defined for 
these prediction errors so that they converge to zero asymptotically. The signals' 
stochastic properties are not required and the error dynamic stability is guaranteed by 
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the Lyapunov Theory. The desigi cf this hybrid predictor isimplified compared to 
existing iybrid or cascadaae ural predictors [IfR]. It ifast corvergence aid Iss 
computational ©mplexity.Furthermorepredictabilityofthehybridpredictorfor noisy 
time seres ianvestigated. Qimputer amulationsusingnonlinearsunspottimes series 
and oher coiventional predictomodels are dmonstrated.Thetheoreticalanalysisof 
the predctormechanismis corfirmedthroughthesesimu lations. 

2 AHybrid Structure of Neural Network-Fir Predictor 

Figure 1 illustratesthe propcsedhybrid predictoBtructure hatis he cascade form of 
MLNN and FIR filter. The actual itne series ontains both Ihear aid ronlinear 
properties and its amplitude is lEually continuous vdue. Ibr these reasins, we 
combine nonlinear and linear predictors in a cascadfbrm. The nonlinear prediction 
problem can be dscribed $ follow: A set cf he pst samples is 

transformed into the output, which is the prediction of the nett coming ample x(n). 

Therefore we employ aMLNN called a MB h the first spge. It cnsists of a 
sigmiodal hidden hyer aid a single oitput neuron. The hSP i traded by he 
supervised leaiing a^orithm cdled lABP [8]. This means he hSP itself acts as a 
single ronlinearpredictor. 

Inrealityitis ratherlifficult ti gnerate the ontinuousamplitude aid bpredict linear 
property. Haice a Hear predictor iemployed dterthe NS’ to ompensatefor he 
linearrelation between thdnput samples aid the target A FIR filters used for ths 
purpose, which will be called a Linear Subpredictor }.^SP). The LSP is trainecky 
Lyapunovtheorybasedadaptivefiltering algorihm(L AF) [9] The ame target or the 
desired tme serieas usecfor boh NS’ aid heLSP. Hence the nonlinear and some 

part of linear properties of the input signal can be preifcted ly the hSP aid he 
remainingpart i^redictedby thd^SP.The current ample,xfnjisused aahe desired 
response,c/fnJfor boh he NS’ aid heLSP. 




Fig. IStructureofth e fybrid pfedictor 



3 Nonlinear Sub-Predictor (NSP) 

The architecture of the MLNN considered is shown in Figure 1 .Hence for convenience 
of presentation, we describe the LABP algorithm with the that L = 2, implying one 
hidden layer. The algorithm can be extended easily to cases when 2 The input x(n) 
is a sampled signal: x(n)= ^(n-1),... x(n-N)} or x(n) =4: x„.j, ...,x„_jv } and the output 
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is a scalar yMSp(n). The purpose of this neural network is to adjust the neural weights in 
order to achieve error between the network output yMSp(n) and the desired output d(n) 
converge to zero asymptotically. Let 'WjJ^’*(n) denote the connection weight between 
the i'th neuron in the input layer, 1=0 an^' th neuron in the hidden layer, Z=7(for i = 1, 
2, ...N;j = 1, 2, ...M). Let S/n) and F/-) be the output and the activation function of 
the /th neuron in the hidden layer, respectively. denotes the connection 

weight between th/' th neuron in the hidden layer and the neuron in the output layer 
y(n), 1=2. Then we have the following system equations: 



Jnsp (n) = 2lk//'S,(«) 

7 = 1 

SjM = F,||; 

where j = 1, 2, ...,A/and = 1,2,..., N. 
Substituting (3.2) into (3.1) gives 

M 

y.sp («) = E ik,</>(«)F,| X Ik 

7 = 1 

1 






(3.1) 

(3.2) 

(3.3) 



where 



F(.) = 



1 + 

The prediction error for NSP is computed as follow, 

eNSp( n) = d(n)- y^spi n) (3.4) 

The weight vectors of the MLNN can be updated using the following expressions: 

{n) = {n-\) + (n) 

<>) — w <i) / „ _ n , A M/ (1) i 



and Ik j" (m) = (m - 1) + Aik/' (n) 



where , m/ ^ 1 

A IT (n ) = 



1 



S An) M 



Alkj"(n) = 
u{n) = 



Ik (fj - 1) + 



d(k)-J^ tk//>(n)S,(n) 

7 = 1 

1 1 



(3.5) 

(3.6) 

(3.7) 



N Xi (n) 



g j(u{n)) 



1 



M lk, .(n) 

§,-(•)= F7'(») 



d (n) 



(3.8) 

(3.9) 



The detailed derivation and design of the LABP algorithm can be found in [8]. 

To prevent the singularities problem due to zero values of Xi(n) and AWjj 2 (n), those 



weights updated law AWjJ^\n) andX Wi/(n) (3.7) (3.8) can be modified as follow 



AtkiY'(w) : 



AW/'(«) = 



1 



1 



S An) M 



d(n)-£lk.Y'(n)S/n) 

>1 



- Ik (n - 1) + 

N X; (n) + Aj 



g 



i(u(n)) 



u(n) = — i d (n) 

M W^An)+X^ 



where 

The smaller values of A,i ancL 2 contribute smaller error e^spin). 



(3.10) 

(3.11) 

(3.12) 
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4 Linear Sub-Predictor (LSP) 

The linear sub-predictor (LSP) consists a conventional finite-impulse -response (FIR) 
filter. It can be characterized by the difference equation 

K-l 

yLSp('^') = 'E^M^yNSp(f^-i') ( 4 - 1 ) 

(=0 

The difference equation in (4.1) can be rewritten in vector form as 

yLSp( n) = H ^ (n)Y„sp (n) (4.2) 

where H(k) = -l)f 

YnsAh) = \yNsAn), Ynsp (n - 1), ..., Ynsp (n-K+1)] 



The LSP's coefficient vector is updated by the LAF algorithm [11]. 

H(n) = H(n - 1) + g(n)a(n) (4.3) 

where g(n) is the adaptation gain and a(n) is the a priori estimation error defined as 

a(n)= d(n) - (n-l)YMSp(n) (4.4) 

The adaptation gain g(n) in (4.3) is adaptively adjusted using Lyapunov stability 
theory as (4.5) so that the error eLSp(n)=d(n) - ytsp(n) asymptotically converges to zero. 



g (n) = 



(n) 



II Lw («)ir 



1 ■ 



(n - 1) 



\a(n) 



where 0 < K < 1. To avoid singularities, (4.5) can be modified as (4.6) 



g(n) = 



(«) 



A3+II (n)r 



1 -K 



kisp (n-1) 



a (w) 



A4 + 



(4.5) 



(4.6) 



where X 3 , Xu are small positive numbers and 0 < K < 1, then the prediction error 
eup(n) asymptotically converges to a ball centered on the origin that the radius of the 
ball depends on X3, X^. According to [10], the prediction error eLSp{n) will not 
converge to zero if the adaptive gain g(n) is adjusted using expression (4.6), but it will 
converge to a ball centred at the system origin. Again, the convergence rate is affected 
by the choice of K. The radius of the ball depends on the values of X 3 and X^. 
Generally, the smaller X3 and Xu are, the smaller the error Cisp(n) is. 



Prediction Analysis: The role of LSP is to predict the prediction error caused by the 
NSP [3], [4]. It analysis can be summarized as follow: From the expression (3.4), we 
rewrite (3.4) as 

yNSP^n) = d{n)-e^sp{n) (4.8) 

Due to the LSP is the FIR structure with K taps, its output ypsp(n) can be expressed as 
y LSP (^) ~ ^oy NSP (^) ■*■ ^[y NSP (n — l) + ... + i^ti — K + X) (4.9) 

By substituting the expression (4.8) into (4.9), we get 

Ylsp (”) = K(d(n) - («)) -I- {n-i) + ... + (n-K + 1) (4.10) 

= hod(n) + [-h„ef/sp (n) + hiyp/sp (n-l) + ... + hf._iyf/sp (n - K + 1)] 

Let y*(n) = h^yf^^p(n-l) + ...+ hJ^_^yf^^pin-K + l) (4.11) 

With the assumption that hg- I, the expression (4.10) can be rewritten as 

ypsp (n) = d(n)- {n)-y* («)] 



(4.12) 




Nonlinear and Noisy Time Series Prediction 197 



Therefore the final prediction error can be expressed as 

final = e^p{n) = d (n) - (n) = e^^p {n)-y*{n) (4.13) 

Hence, the function of LSP is to predict the prediction error resulted from the NSP. 
The contribution of the NSP and the LSP in the overall performance of the proposed 
hybrid prediction can be measured by the following ratio 

R = Pnsp/ Plsp (4.14) 

Where P/^^p ^rid Pp^p are the power of the NSP output and LSP output respectively. 
The normalized root mean square error (NRMSE) [3], [4] is used to express the 
prediction error so that they can be used for comparison. It is calculated as 

NRMSE =^MSeT1^ (4.15) 

where MSE: the mean squared error of NSP or LSP. is the input signal power. 

5 Simulation Results Using Hybrid Model 

Nonlinear Times Series - Simulations have been done for a one-step ahead prediction 
of 2 examples: Sunspot data and Chaotic data. Sunspot data is used as a benchmark for 
many years by researchers. Data file of the Sunspot times series is download from 
[11]. It consists the sunspot data from the year 1700 to 1999 (300 Samples). Fig. 2 
shows the plots of the sunspot time series. Fig. 3 illustrate the plot of the output of the 
hybrid predictor for sunspot time series (1950-1999). Fig. 4 and 5 show the square 
predictor error of NSP, etisp(n) and LSP, epsp^(n) respectively. 

Comparison With Other Models - In this section, the prediction performance of the 
proposed hybrid predictor, a linear FIR predictor and a nonlinear MLNN predictor 
with a linear output neuron are compared for the Sunspot time series. Comparison 
using different kinds of predictor was demonstrated in [3]. The simulation results 
using the Sunspot time series are tabulated in Table 1. Compared to those models, the 
proposed hybrid predictor has the minimum prediction errors in both cases. The linear 
predictor does not perform well due to the high nonlinearity in the time series. 

6 Gbuclusiou 

A hybrid nonlinear time series predictor that consists the NSP and the LSP combined 
in a cascade form is proposed. The NSP is a MLNN and the algorithm used to update 
NSP weights is Lyapunov stability-based backpropagation algorithm (LABP). The 
nonlinearity and some part of linearity of the input time series is predicted by NSP. 
The LSP then further predict the NSP prediction error. Weights of the LSP are 
adaptively adjusted by the LAF. Fyapunov functions are defined for these prediction 
errors and convergence to zero asymptotically is desired to achieve. Signals' stochastic 
properties are not required and the error dynamic stability is guaranteed by the 
Lyapunov Theory. The design of this hybrid predictor is simplified compared to 
exiting hybrid or cascade neural predictors. It is fast convergence and less computation 
complexity. Predictability for the noisy time series is also investigated. Properties of 
these predictors are analyzed taking the nonlinearity of the time series into account. 
Hence the prediction mechanism and the role of the NSP and LSP of the hybrid 
predictor have been theoretically and experimentally analyzed and clarified. 
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using the proposed hybrid predictor (1950-1999) 
' original data', ’ * predictor output data' 




Model 


Proposed Hybrid Predictor 


MLNN Predictor 
(BP) 


Linear FIR 
predictor(LMS) 


NRMSE 


4.6x10 ■’(NRMSE of ESP) 


0.092 


0.2897 




0.091 (NRMSE of NSP) 







Table 1 : Comparison of NRMSE among different models for sunspot data 
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Abstract. For many practical applications, such as planning for satellite orbits 
and space missions, it is important to estimate the future values of the sunspot 
numbers. There have been numerous methods used for this particular case of 
time series prediction, including recently neural networks. In this paper we 
present genetic programming technique employed to sunspot series prediction. 
The paper investigates practical solutions and heuristics for an effective choice of 
parameters and functions of genetic programming. The results obtained expect 
the maximum in the current cycle of the smoothed series monthly sunspot 
numbers is 164 ± 20, and 162 ± 20 for the next cycle maximum, at the 95% 
level of confidence. These results are discussed and compared with other 
predictions. 



1 Introduction 

Planning high frequency communication links and space-related activities undertaken 
by NASA and other organizations, requires prediction of solar activity, and in 
particular, to estimate the sunspot numbers for the oncoming years. 

The sun undergoes periods of high activity, when many solar flares and coronal 
mass ejection take place, followed by relatively quiet periods of low activity. One 
way to investigate solar activity is to measure systematically the number of sunspots - 
during the active periods this number is high and goes down again later (the cycle last 
about 11 years). 

The sunspots affect also the conditions on the Earth, for example, it may disrupt 
radio communication. Possibly, there is a relation between sunspots and the Earth 
climate. 

The characteristics of the sunspot number series are similar to those frequently 
occurring in financial and economic models (they contain sudden irregular variations, 
cycles and trends), therefore the solution for this problem extends beyond 
astrophysics, radio communication and space services. 

Time series prediction is one of the most prevalent functions of any serious 
business operation. The model used here is the univariate model, which uses the past 
series to predict the future values. Unfortunately, univariate models omit cause 
variables and the prediction is based solely on the discovered pattern in historical 
data (some statisticians are quite sceptical about certain practices used in prediction. 
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for example Moroney ^ calls one chapter of his book “Prediction and Fortune 
Telling”). Since the phenomenon of sunspots is not entirely understood, it seems that 
there is some justification in the choice of this model. A great number of methods 
exist for forecasting generally, and some were specifically developed for the sunspot 
number series [3]. 

In this paper, we discuss genetic programming, a variation of evolutionary 
computation, applied to this type of prediction. Genetic programming, developed by 
John Koza [4] is inspired by evolution, and operates on a population of individuals, 
which in this case are computer programs. These individuals participate in a 
simulated evolutionary process and, as a result of it, an optimal or near optimal 
solution emerges. 

The program used in this investigation has been written in C++ and uses C-like 
syntax trees (i.e. programs) to represent solutions (i.e. individuals). The initial 
population consists of programs that are randomly generated syntax trees. These 
programs are created as hierarchical compositions of primitives - non-terminals (or 
functions) and terminals (variables or constants). A node on the syntax tree represents 
a terminal or a function. The selection of primitives is a very significant step in 
genetic programming. The set of primitives may include arithmetic and Boolean 
operators, mathematical functions, conditional and iterative operators and any other 
functions predefined by the user. 

Evolutionary computing is a repetitive process of transforming populations of 
programs by means of reproduction, crossover and mutation. As a termination 
condition, the number of generations can be chosen and the best program that appears 
in any generation is the solution. 

2 Representation of the Time Series Prediction Problem for 
Genetic Programming 

In Figure 1 an example of an evolved program as a solution for the sunspot number 
series prediction is shown (exactly in the form as it was generated by our software). 
GPvalue = // value of the following program: 

((w[3])+((((((w[227])/((w[118])+(w[161])))/(-37.1942))/(((( 
(w[258])/(((((((((((((((((((w[120])+((w[326])- 
w[78])))/((w[106])*(w[95])))/((w[106])»=(w[95])))+ 

((w[140])+(w[8])))»=(w[16]))* 

(w[51]))+((w[263])+(w[16])))-((w[210])+(w[116]))) 

-((w[316])-(w[197])))+(((w[40])/((((((((((((w[71])-(( 

w[100])-((((((w[46])-(w[109]))+((w[206])-(w[286]))) 

/(w[193]))*(((((w[105])+((w[140])+ 

(w[8])))+((w[37])*(w[260])))/((w[37])-(w[84])))- 

(w[91])))-((w[276])-(w[333])))))*((w[255])+(w[134] 

)))-(((w[7])/(w[99]))*(((w[16]) 

+(w[50]))/((w[224])=^(((((w[300])*(w[245]))/(( 

w[153])-(w[173])))+((w[52])/(w[281])))+((w[37]) 



* Moroney, M.J. “Facts from Figure” Penguin Books, 1964 
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/(w[166]))))))H(w[289])*(w[41])))+((w[326])-( 

w[78])))/((w[226])*(w[63])))+((w[173])/(w[30]))) 

+w[68])+((((w[75])*(w[109]))/((w[131])/(w[99]) 

))/((w[118])+(w[161]))))))+(((w[73])-(w[78]))+( 

w[129])))-((w[16])+(w[50])))/((w[75])+(w[245])))) 

-(w[306])))*(((w[131])/(w[225]))+( w[6])))/((w[4]) 

+(w[121])))/((w[194])+(w[276])))*((w[49])+( 

w[173])))/((w[254])-(w[158])))*((w[71])/(w[243]))) 

/(w[225]))+(w[263]))H(w[16])+(w[245])))/((w[75])+ 

(w[245])))/(((w[134])+(w[258]))*(w[109]))))-( 

(w[16])+((w[16])+(w[50]))))/(58.37))) 

; // end of the Gpvalue 

Fig. 1. A sample solution - a program that returns predicted sunspot number 

Each program is an expression and when evaluated it returns a real number - a 
sunspot number (called in Figure 1 GPvalue). The excessive number of brackets 
makes these programs hard to read but they are legitimate C functions and they can 
be included in an ANSI C or C++ program. The variables, the operators and 
constants will be explained shortly. 

The time series data is a sequence of numbers n[tj, n[tj,...,n[tj,...,n[t^j, and 
usually no additional information about the meaning of the series and the best 
prediction period lag is available. The aim is to find the best prediction of the unseen 
continuation of the unavailable sequence. We will attempt to evolve the solution 
using a certain number (smaller than lag ) of values directly preceding time t , that 
is a “window” w[ t ,^^,]...,w[ t ,] with j going through a part or the whole series (the 
result presented in Figure 1 is an expression of elements w[ t^ ] of such a window). 

We will seek a solution in a form of a polynomial of high degree, therefore the 
used operators are the standard arithmetic operators: “+”, (slightly 

modified to ensure the closure property, for example allowing division by 0). The 
experiments with other operators, such as trigonometric function sin and cos, and if- 
then did not contribute anything better (after all it is possible to find a high-degree 
polynomial that can fit any data perfectly). 

The set of terminal includes also constants, initially real numbers generated 
randomly. In the process of evolution, mutation may change them. 

The parameter lag, that determines the period of time taking into consideration for a 
single prediction has been chosen after many experiments as /ag=140 which for 
monthly data corresponds roughly to one cycle. This means that prediction of a value 
of time t,^ has been calculated using values selected from time t^,, t, 3 g,...,t„. Note that 
not all the values have to be present, and some may be used many times. Strictly 
speaking, this parameter is not particularly important- solutions with a big lag would 
include more variables, many of them irrelevant. 

3 Data 

The history of observing sunspots is very long, but since invention of the telescope, 
quite systematic records are kept. There is a commonly accepted definition of an 
average sunspot grouping and observing conditions. The available data goes back as 
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Fig. 2. Test prediction of cycles 21-22 

far as 1750, but as it has been shown, the earlier records are not very reliable [2], We 
will follow other researches and use observations starting from 1850. The data used 
here are monthly actual and monthly smoothed sunspot numbers, available for 
example from Sunspot Index Data Centre in Royal Observatory of Belgium. The 
Figure 2 shows a small part of the actual data. The smoothed data used here is so 
called “13-month running mean”. The smoothing for month n is obtained from the 
following formula: 



1 5 

24 " 






u 



' /=-6 



24,t_5 



where is the actual value month n. 

The solutions were evolved here using samples of smoothed data, selected 
randomly with the sampling rate around 10%. For testing predictions, cycles 21 and 22 
were used (this corresponds roughly to the period 1975 -1999). In many applications of 
evolutionary techniques testing is not a necessary step - the fitness indicates the quality 
of the solution. However here, taking into consideration random character of sampling, 
the solution was additionally tested each time when “better-so-far” result was obtained. 



4 Experiments and Results 



To evolve solutions we used sampling techniques. The fitness of each program was 
calculated for three different samples (a greater number of samples increases 
considerably the number of evaluations - k samples means, practically, k times 
longer time of computing). Each sample consists of 200 randomly selected points 
from the training data set (consisting of around 1500 cases), and this number of times 
the solution was evaluated. The final fitness was ten taken as the average value over 
all samples. This approach, which may be thought as a version of cross validation for 
big data sets, was chosen to minimise overspecialisation (or overfitting effect). 
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Fig. 3. Prediction of cycles 23/24 

Overfitting is the major problem we face in time series prediction - genetic 
programming is extremely good for symbolic regression (finding a function that fits a 
given set of points), but there is no guarantee that the evolved program has the 
capability for generalisation. The method describe above and used in this work seems 
to deal with this problem quite well. Using smoothing data instead actual sunspot 
numbers also helps to fight overspecialisation. 

We conducted a number of experiments, trying to establish the best parameters for 
genetic programming to get the best results. Table 1 shows selected results of best ten 
predictions. 

These programs has been evolved under the following conditions: 

• fitness function: mean squared error (MSE); 

• population size : 10 000; 



Run 


maximum 
cycle 23 


year 


maximum 
cycle 24 


year 


MAD 


1 


179 


2001.12 


170 


2012.02 


1.8 


2 


172 


2001.09 


183 


2012.02 


1.8 


3 


151 


2001.02 


164 


2011.05 


3.5 


4 


197 


2001.08 


151 


2011.01 


7.6 


5 


150 


2001.03 


152 


2012.01 


1.7 


6 


153 


2000.07 


148 


2010.02 


2.7 


7 


155 


2000.07 


162 


2010.02 


2.7 


8 


154 


2000.07 


148 


2010.02 


2.7 


9 


167 


2000.05 


173 


2011.09 


2.4 


10 


167 


2000.05 


170 


2011.09 


2.5 


Average: 


164.5 


2000.57 


162.1 


2011.04 


2.94 



Table 1. Results of ten mns (maxima of cycles 23 and 24 and Mean Absolute Deviation of the 
testing prediction) 
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• selection method: 10% population selected using the tournament method with 
the size of a group equals to 4; 

• crossover: one point cross-over; 

• mutation: terminals, with probability 0.01; 

• maximum number of generations: 80; 

• lag randomly selected in each experiment from interval 140 ± «, 0 < « < 70 (a 
value between one cycle and one and half cycle). 

When it was necessary to do so, in the situation when actual data was not available 
yet, the predicted results were smoothed and then used for further prediction. 

One particular solution is shown in Figure 1 . 

In the Figure 3 the prediction (average from ten runs) for the cycles 23 and 24 is 
shown. The maximum 164.5 ± 20 of cycle 23 is expected to be in July 2000, and the 
maximum 162.1 ± 20 of cycle 24 is predicted at January 2011. The average Mean 
Absolute Deviation is 2.94. All these results are within distribution independent 
confidence interval 95%, (as defined in [5] ). 

This prediction for maximum of cycle 23 differs little from value 154 ± 21, obtained 
by Hathaway at al. [3] who performed very comprehensive research using a various 
classical and specialised techniques. The results obtained from neural networks 
(maximum 130 with uncertainty ± 30-80 by Conway et al. [2]) differ more from our 
results but admittedly, they have to their disposition less data available in 1998. In 
addition, in these two reported cases [1, 2] of neural network prediction the annual 
mean sunspot numbers were used, simplifying the computation for the cost of 
accuracy. 

The Solar Cycle prediction Panel of Space and Environment Centre issued a 
report that predicts the smoothed monthly maximum of Cycle 23 as 159 to occur 
between January 2000 and June 2001, which lays in boundary of our results. 



Sunspot prediction of cycie 23 
(1,2,6 years ahead) 




1-year 

2-years 

6 years 



year 



Fig. 4. Prediction of cycle 23 1, 2 and 6 year ahead 
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There is possible to employ a different approach to predicting longer intervals. So far 
the presented results were obtain from month by month prediction, that is, to predict 
a sunspot number at time t„ the previous values from series t,^g,...,t 2 , tj were used. It is 
much harder to obtain a good accuracy for predicting for longer period ahead. In 
Figure 4 we show results of prediction 1, 2 and 6 year ahead for cycle 23 and the 
beginning of cycle 24 (the expression “x year ahead” means here that in process of 
training and testing the data from months preceding 12*x was excluded from the 
process of evolving the solution and prediction). 



5 Conclusion 

We have used genetic programming for prediction the solar activity measured as 
monthly Wolf number. The results compare favourably with other extended research, 
when classical methods and neural networks were used for sunspot number 
prediction. Genetic programming is a non-linear data driven and adaptive technique 
that appears useful for forecasting. The paper proposes practical solutions for genetic 
programming approach to the problem of time series prediction. 
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Abstract. In fractal analysis of a time series, the relationship between series 
length and ruler length may be represented graphically as a Richardson Plot. 
Fractal dimension measures can be estimated for particular ranges of mler length, 
supported by the graphical representation. 

This paper discusses Richardson Plots which have been obtained for several 
types of time series. From these, patterns have been identified with explanations. 

There is particular focus on local maxima and minima. Significant influences 
found present are described as gradient and vertex effects. The task - and 
implications - of partitioning the range of ruler lengths in determining fractal 
dimension measures is briefly addressed. 

1. Introduction 

Historically, the discovery of mathematical structures that did not fit the patterns of 
Euclid and Newton raised doubts about the appropriateness of the term Exact Sciences 
often associated with mathematics. These patterns and structures caused a revolution 
led by mathematicians such as Cantor, Peano, Von Koch and Sierpinski. Cantor 
conceived a set known as Cantor’s Dust [7], constructed by dividing one-dimensional 
line segments, which seemed to bend the concept of dimension. Nevertheless, it was 
not until the 1960’s when this phenomenon was extensively studied. Mandelbrot 
associated these shapes with forms found in nature, defining them as fractals [9]. 

The observation of fractal shapes in nature spawned considerable work in trying 
to model them with a particular class of mathematical equations. These equations are 
composed of basic elements which, when the equations are iterated, give birth to new 
elements similar in some aspect to the originals. This peculiarity is known as self- 
similarity and has been associated with shapes such as the Koch snowflake{A,5,9]. 
Richardson [10,13] showed how estimates of the lengths of international borders 
obtained by the Hausdorff approach[5] were inversely related to the length of the ruler 
used. Plotting these estimates against ruler length on a log-log graph - which became 
known as a Richardson Plot - produced a near linear association. Mandelbrot [9,10] 
interpreted the slope of this line to be a measure of the geometric dimensionality of the 
boundary. This value is fractional and lies between 1 and 2. Mandelbrot[9] used the 
term fractal dimension. 
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More generally a fractal dimension (FD) attempts to measure an object/figure at 
different scales, in order to extract a relation factor from these measurements[2,9]. 
This concept is itself a special case of a general dimension defined by Hausdorff[5], 
which permits the use of this measure (FD) even with sets which may not necessarily 
be self-similar over a wider range of space or time, such as the time series considered 
in this paper. 

Kaye[7] has described how fine particles may be classified by the fractal 
dimension of their rugged boundaries. Kaye also showed how natural boundaries can 
possess two distinct fractal scalings leading to two fractal dimensions. These occur as 
two distinct straight sections on the log-log plot of length estimate against ruler length. 
Kaye used the terms textural fractal dimension (TFD) and structural fractal 
dimension(SFD) to describe the measures associated with short and longer rulers 
respectively. 

Approaches to normalisation vary. Kaye normalised both length and ruler size 
using the maximum Feret’s diameter, while the maximum horizontal distance 
(observed time) of a time series has been used elsewhere to normalise length only 
[6,14]. 

Caution has been advised in the interpretation of a Richardson Plot. For example, 
Avenir[l] referred to data lines fitted at coarse resolution exploration to shapes with a 
high aspect ratio as fractal rabbits. Kaye[7,8] also advised a restriction on the size of 
ruler to less than 30% of a maximum Feret’s diameter in exploring a rugged profile, 
and further that if the aspect ratio is greater than 3, any low- value fractal should be 
treated warily. 

2. Method 

Several time series models were studied, including series with a small number of 
vertices and others sampled from negative exponential, random or sinusoidal 
distributions. Each comprised 200 observations. 

For each series a procedure to estimate length from the first to the last value was 
followed, after Hausdorff[5]. This may be explained by reference to the graph. On the 
graph of each, all the points were joined in series by straight lines. Then starting at the 
(left) first observed value, an arc of a fixed size r (ruler) was drawn. The first 
interception on the joined series of this arc became the next position from where the 
arc of this radius was drawn again. This process continued until the 200“' observation 
was covered. An appropiate measurement (C,) was taken at the end of the series when 
the remaining part was less than a full ruler length. A count of the number of 
completed rulers of length r was determined. A length L, = • r -i- was recorded. 

This whole process was repeated with different sizes of ruler. Normalisation was not 
applied. The Richardson Plot was constructed for each model - this represents log 
against log r from the above process. 

Where calculated, fractal dimension measures were estimated from 1-m where m 
was the slope of a linear regression model fitted over an appropiate range of rulers. 

The covering and other computational procedures were implemented as 
algorithms in Borland C-i-H- and the graphics were studied using SPSS Version 9.0. 
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3, Observations 



The Richardson Plots observed generally displayed decreasing trend left to right - the 
gradient was small initially, then steepening and finally decreasing to zero. 

Several causes of variation from this general shape may be highlighted. One 
feature, named a gradient ejfect, is explained by a steep fall in the time series, 
whereby an incrementally longer ruler is associated with a significant reduction in 
contribution to total length. This effect was particularly prominent in graphs of 
random time series with steeply descending links between some points. A significant 
reduction in ruled length was also observed when there was a convex pattern in a 
sequence of decreasing plotted values. This was named a curvature ejfect. 

Volatility was evident in the profiles of many of the Richardson Plots. One cause 
was named the vertex effect. Figure 1 illustrates this for a simple time series where the 
observations he on two connected straight lines of equal length. 




'.)))) 




>3 = 2*r, 



Fig. 1: Vertex effect 

In Figure 1 although ruler length r^ lies between r^ and tj in size, the 
corresponding measured time series lengths Lj and L 3 are both greater than L^. As Lj 
and L, cover the series exactly, they correspond to local maxima on the Richardson 
Plot. Any other size of ruler which exactly covers the series to the vertex (say the left 
half) will also lead to a local maximum - for example rJ2>, r,/4, etc. Furthermore, if r^ 
cuts at equal distance either side of the vertex, L, corrresponds to a local minimum on 
the Richardson Plot. The model exhibited in Figure 2 further illustrates this effect. 





y = 100 — 1.6 jc 1 jr < 50 
y = 26.6-0.13xJ a:>=50 



Richardson Plot 



Fig. 2: Effect of vertex in a time series on corresponding Richardson Plot 
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The Richardson Plots for observations drawn from a series of exponential 
functions with increasingly negative exponents are summarised in Figure 3. The 
progressive appearance and prominence of the local maxima and minima over the 
series can be observed. This can be attributed to the vertex effect. 

In Figures 1-3 the mirror image of each series about the line connecting the start 
and end points will have a corresponding Richardson Plot identical to that displayed. 




loB(i) 



Fig. 3: Vertex effect on exponential series 

Where a distinctive summit occurs in the original time series, increasing ruler size 
r may lead to a local maximum in the measured length L,. In Figure 4, a short ruler r, 
provides a total series length L,. Increasing the size of the ruler causes a decrease in 
measured total length initially, but if the summit is pronounced, a longer ruler r_^ may 
produce a longer L^, whereas a still longer ruler r_^^| leads to a shorter We refer to 
this as a peak effect, a development of the vertex effect (skewness in the summit as 
illustrated will add a curvature effect). Likewise a corresponding trough in the original 
time series will also lead to a corresponding local maximum on the Richardson Plot, a 
valley effect. 




Fig. 4: Peak effect 
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In a time series displaying a regular wave pattern (with amplitude small relative 
to the time scale) and a linear trend the occurrences and causes of local minima and 
maxima are readily identifiable. In Figure 5, for example, a ruler of length half the 
period of the cycle - or integer multiples of same - will correspond to local minima on 
the Richardson Plot. 





y = 8 • sin(0.2x) 
y = 4-sm(0.2x) 



Richardson Plot 



Fig. 5: Cyclical time series with linear trend and associated Richardson Plot 



Obviously with a time series having many segments of unequal length and 
varying patterns at different scales of enlargement the interpretation of volatility from 
the general shape of the associated Richardson Plot is more complex. Such volatility 
will influence the values of extracted fractal dimension measures such as TFD and 
SFD. 

Further comparative studies were made using an approximation to the Hausdorff 
approach in which instead of placing the ruler along the time series it measures 
horizontally [6, 14]. This method yielded significantly different Richardson Plots 
(practitioners tend to support employing a particular method throughout their related 
analyses and interpret derived fractal dimensions on a comparative basis). Figure 6 
provides an illustrative example for a randomly generated time series. Associated TFD 
and SFD measures will differ in actual and relative order of size (the difference 
between the graphs in values represented on the vertical axes is due to normalisation 
used and does not affect the derived measures). 



4. Future Directions 

Ongoing investigation includes study of the criteria and process for selecting 
boundaries in grouping the rulers, for example in determining TFD and SFD values. 
Some approaches might include identifying boundaries on the basis of maximising the 
linear correlation coefficient for each set of related points, partitioning to create 
maximum difference in gradient between say TFD and SFD, development of a 
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guidance/expert system to support for example coping with local maxima and minima, 
or automated profile analyser with associated report generator. 





Richardson Plot 



Richardson Plot 
(horizontal rulers) 



Fig. 6: Comparison of Richardson Plots for random time series 

The approach so far has used linear interpolation in covering the series using a 
particular ruler - the potential of non-linear interpolation merits consideration, 
particularly as a potential approach to reduce the related vertex effect. 

Further current and planned studies include comparison with other related 
methods of analysis, such as rescaled range analysis (Hurst Exponent). Application to 
a continuous series of observations would facilitate comparison with associated 
methods such as wave analysis. Also the potential in convergence studies involving for 
example telecommunications and financial data [3,11,12] is being assessed. 
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Abstract. We investigate the application of a wavelet method of lines 
solution method to financial PDEs. We demonstrate the suitability of a 
numerical scheme based on biorthogonal interpolating wavelets to finan- 
cial PDE problems where there are discontinuities or regions of sharp 
transitions in the solution. The examples treated are the Black Scholes 
PDE with discontinuous payoffs and a 3-dimensional cross currency swap 
PDE for which a speedup over standard finite difference methods of two 
orders of magnitude is reported. 



1 Introduction 

What are wavelets ? Wavelets are nonlinear functions which can be scaled and 
translated to form a basis for the Hilbert space L^(M) of square integrable func- 
tions. Thus wavelets generalize the trignometric functions given by € M) 

which generate the classical Fourier basis for . It is therefore not surprising that 
wavelet and fast wavelet transforms exist which generalize the time to frequency 
map of the Fourier transform to pick up both the space and time behaviour 
of a function [11]. Wavelets have been used in the field of image compression 
and image analysis for quite some time. Indeed the main motivation behind the 
development of wavelets was the search for fast algorithms to compute com- 
pact representations of functions and data sets based on exploiting structure in 
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the underlying functions. In the solution of PDE’s using wavelets [1, 4, 5, 18, 19] 
functions and operators are expanded in a wavelet basis to allow a combina- 
tion of the desirable features of hnite-difference methods, spectral methods and 
front-tracking or adaptive grid approaches. The advantages of using wavelets to 
solve PDE’s that arise in finance are that large classes of operators and functions 
which occur in this area are sparse, or sparse to some high accuracy, when trans- 
formed into the wavelet domain. Wavelets are also suitable for problems with 
multiple spatial scales (which occur frequently in hnancial problems) since they 
give an accurate representation of the solution in regions of sharp transitions 
and combine the advantages of both spectral and hnite-difference methods. 

In this paper we implement a wavelet method of lines scheme using biorthog- 
onal wavelets to solve the Black Scholes PDE for option values with discontin- 
uous payoff structures and a 3-dimensional cross currency swap PDE based on 
extended Vasicek interest rate models. We demonstrate numerically the advan- 
tages of using a wavelet based PDE method in solving these kind of problems. 
The paper is organized as follows. In Section 2 we give a brief introduction to 
wavelet theory. In Sections 3 and 4 we give an explanation of wavelet based PDE 
methods and explain the biorthogonal wavelet approach in more detail. Sections 
5 and 6 contain respectively the problems and numerical results for the Black 
Scholes and cross currency swap PDEs and Section 7 concludes and describes 
research in progress. 

2 Basic Wavelet Theory 

We now give a brief introduction to wavelets for real valued functions of a real 
argument. Further detail can be found in the cited references and [6] and we shall 
extend the concepts needed for this paper to higher dimensions in the sequel. 

Daubechies based wavelets 

Consider two functions: the scaling function (p and the wavelet function ip. 

The scaling function is the solution of a dilation equation 

OO 

(p{x) = V2'^hk(p{2x - k), 

k=0 



( 1 ) 
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where (p is normalised so that (j){x)dx = 1 and the wavelet function is defined 
in terms of the scaling function as 

OO 

-ip{x) = V2'^gk(pi‘2x - k). ( 2 ) 

fc =0 

We can build up a orthonormal basis for the Hilbert space I? (K) of (equivalence 
classes of) square integrable functions from the functions 0 and by dilating 
and translating them to obtain the basis functions: 

= 2-=l^f>(X=x -k) = ( 3 ) 

= 2-^l'^fiiX^x -k) = ^ . (4) 

In the above equations j is the dilation or scaling parameter and k is the trans- 
lation parameter. All wavelet properties are specified through the coefficients 
H := ^ {9k}^=o which are chosen so that dilations and trans- 

lations of the wavelet ^l} form an orthonormal basis of L^(E), i.e. 

/ OO 

'ipj,k{x)'ipi,m{x)dx = 6jiSkm j,k,l,mez+, (5) 

-OO 

where Z+ := {0, 1,2,...} and 5ji is the Kronecker delta function. 

Under these conditions for any function / G A^(R-) there exists a set {djk} such 
that 



/(2^) = 1] 1] djktpj^kix), 


(6) 


i6Z+fc£Z+ 




pOO 

djk := / f{x)ipj,k{x)dx. 

J — OO 


( 7 ) 


It is usual to denote the spaces spanned by 4>j^k and xpj^k over 
with j fixed by 


the parameter fc. 


Vj . — sparik^x^kfj^k ^^7 ■ — 


(8) 



In the expansion (6) functions with arbitrary small scales can be represented, 
however in practice there is a limit on how small the smallest structure can be. 
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(This could for example be dependent on a required grid size in a numerical com- 
putation as we shall see below.) To implement wavelet analysis on a computer, 
we need to have a bounded range and domain to generate approximations to 
functions / G L^(E) and thus must limit H and G to finite sets termed filters. 
Approximation accuracy is specified by requiring that the wavelet function 
satisHes 

/ OO 

fi{x)x'^dx = 0 (9) 

-OO 

for m = 0, . . . , M — 1, which implies exact approximation for polynomials of de- 
gree M — 1. For Daubechies wavelets [6] the number of coefficients or the length 
L of the filters H and G is related to the number of vanishing moments M in (9) 
by 2M = L. In addition elements of H and G are related by §k = (— 
for fc = 0, . . . ,L—\ and the two finite sets of coefficients H and G are known in 
the signal processing literature as quadrature mirror hlters. The coefficients H 
needed to define compactly supported wavelets with high degrees of regularity 
can be derived [6] and the usual notation to denote a Daubechies based wavelet 
dehned by coefficients H of length L is D^. Therefore on a computer an approx- 
imation subspace expansion would be in the form of a finite direct sum of finite 
dimensional vector spaces as 

Vj = Wo © Wi © W2 © • ■ ■ © Wj_i © Vo, 

and the corresponding orthogonal wavelet series approximation to a continuous 
function / on a compact domain is given by 

fix) w Y. do,kfio,k{x) h ^ dj-i^kipj-i,k{x) + 'Y ^o,k4>o,k, (10) 

k k k 

where J is the number of multiresolution components (or scales) and k ranges 
from 0 to the number of coefficients in the specihed component. The spaces Wj 
and Vj are termed scaling function and approximation subspaces respectively. 
The coefficients do.fcj--- ,dj-i.k,so,k are termed the wavelet transform coeffi- 
cients and the functions and t/'j.fc are the approximating wavelet functions. 
Some examples of basic wavelets are the Haar wavelet which is just a square 
wave (the indicator function of the unit interval), the Daubechies wavelets [6] 
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and Coiflet wavelets [2]. 

Biorthogonal wavelets 

Biorthogonal wavelets are a generalization of orthogonal wavelets first intro- 
duced by Cohen, Daubechies and Feauveau [3]. Biorthogonal wavelets are sym- 
metric and do not introduce phase shifts in the coefficients. In biorthogonal 
wavelet analysis we have four basic function types and -0. The functions 

0 and 0 are termed mother and father wavelets and the functions 0 and 0 
are the dual wavelets. The father and mother wavelets are used to compute the 
wavelet coefficients as in the orthogonal case, but now the biorthogonal wavelet 
approximation of a continuous function on a compact domain is expressed in 
terms of the dual wavelet functions as 

f{x) w Y1 do,k'^o,kix) H h ^ ^ so,k4>J,k- (11) 

k k k 

In signal processing 0 and 0 are used to analyze the signal and 0 and 0 are 
used to synthesize the signal. In general biorthogonal wavelets are not mutually 
orthogonal, but they do satisfy biorthogonal relationships of the form 

f 0yfc0j',fc0^)d^ = Sjj/6k.k' 

f 0j.fc0j',fc(x)da; = 0 

/ 0j,fc0j,fc' (x)dx = 0 

/ ipj,ki’j'.k'{x)dx = 6jj>Sk,k’, (12) 

where j,j', k, k' range over appropriate hnite sets of integers. 

3 Wavelets and PDE’s 

Wavelet based approaches to the solution of PDE’s have been presented by Xu 
and Shann [21], Beylkin [1], Vasilyev et al [18,19], Prosser and Cant [14], Dahmen 
et al [5] and Cohen et al [4]. There are two main approaches to the numerical 
solution of PDEs using wavelets. Consider the most general form for a system 
of parabolic PDEs given by 
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which describe the time evolution of a vector valued function u and the boundary 
conditions are possibly algebraic or differential constraints. The wavelet-Galerkin 
method assumes that the wavelet coefficients are functions of time. An appro- 
priate wavelet decomposition for each component of the solution is substituted 
into (13) and a Galerkin projection is used to derive a nonlinear system of or- 
dinary differential-algebraic equations which describe the time evolution of the 
wavelet coefficients. In a wavelet- collocation method (13) is evaluated at collo- 
cation points of the domain of u and a system of nonlinear ordinary differential- 
algebraic equations describing the evolution of the solution at these collocation 
points is obtained. 

If we want the numerical algorithm to be able to resolve all structures appearing 
in the solution and also to be efficient in terms of minimising the number of 
unknowns, the basis of active wavelets and consequently the computational grid 
for the wavelet-collocation algorithm should adapt dynamically in time to reflect 
local changes in the solution. This adaptation of the wavelet basis or computa- 
tional grid is based on analysis of the wavelet coefficients. The contribution of a 
particular wavelet to the approximation is significant if and only if the nearby 
structures of the solution have a size comparable with the wavelet scale. Thus 
using a thresholding technique a large number of the fine scale wavelets may 
be dropped in regions where the solution is smooth. In the wavelet-collocation 
method every wavelet is uniquely associated with a collocation point. Hence a 
collocation point can be omitted from the grid if the associated wavelet is omitted 
from the approximation. This property of the multilevel wavelet approximation 
allows local grid refinement up to a prescribed small scale without a drastic in- 
crease in the number of collocation points. A fast adaptive wavelet collocation 
algorithm for two dimensional PDE’s is presented in [18] and a spatial discretiza- 
tion scheme using bi-orthogonal wavelets is implemented in [12-14]. The wavelet 
scheme is used in the latter to solve the reacting Navier-Stokes equations and 
the main advantage of the approach is that when the solution is computed in 
wavelet space it is possible to exploit sparsity in order to reduce storage costs 
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and speed up solution times. We will now explain the wavelet-collocation method 
in greater detail. 

The biorthogonal wavelet approach 

The main difference in using biorthogonal systems is that we have both pri- 
mal and dual basis functions derived from primal and dual scaling and wavelet 
functions. Biorthogonal wavelet systems are derived from a paired hierarchy of 
approximation subspaces 



• • • V,_1 C V,. C . . . 

(Note that here increasing j denotes refinement of the grid, although some au- 
thors in the wavelet literature use an increasing scale index j to indicate its 
coarsening.) For periodic discretizations dim{\ j) = 2^. The basis functions for 
these spaces are the primal scaling function (p and the dual scaling function p. 
Define two innovation spaces Wj and Wj such that 

V,.+i:=V,.©W,. 

Vi+i:=V,-©W,- (14) 

where Vj±Wj and Vj±Wj. The innovation spaces so defined satisfy 

OO OO 

0W,=L2(K) =0W, (15) 

j=0 j=0 



and the innovation space basis functions are ip and ip. 
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Biorthogonal Interpolating wavelet transform 

The biorthogonal interpolating wavelet transform [8, 9] has basis functions 

(j)j,k{x) = - k) 

— 2fc — 1) 

}j,k{x) = 6{x - Xj^k), (16) 



where i5(.) is the Dirac delta function. The wavelets are said to be interpolating 
because the primal scaling function 4> to which they are related satisfies 



(fik) 



j 1 fc = 0, 

[o k^O. 



The projection of a function / onto a space of scaling functions Vj is given 
(discretizing [0, 1]) by 

PvJ(x) = sl^4>°f,{x), (17) 

k=0 

where sj is defined as < /, ^j^k > in terms of a suitable inner product and (jP 
is used to denote a boundary or internal wavelet given by 






<l^j,kix) 
4^j^k (^) 



fc = 0, . . . , M — 1 
k = N,... ,2^ - M 
fc = 2-J - M + 1, . . . , 2T 



(18) 



Fast biorthogonal wavelet transform algorithm 

The projection of a function / onto a finite dimensional scaling function space 
Vj is given as above by 



Pvjix) = XI < fi'PPjA'P > 

k 

= J2Ak/2P(j}j,k{x) 

k 

= J2^ikA,k{x), 

k 



( 19 ) 
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where Sjf. = /(fc/2^). The coefficients at resolution level j must be derived using 
PwJ{x) = Pv^+Jix) - PvJ{x) 

= 'Em S^j + l,m^j+Xm{x) ~ En (20) 

An arbitrary wavelet coefficient ^ can be calculated from 

4m = 4+1.2m+l - Y1 -n+ 1/2) = 4+l,2m+l “ (^1) 

n n 

where T is a square matrix of size 2^ x 2^ for periodic discretizations defined by 

rmn ■= <j>A -n + 1/2). 

Because of the compact support of the primal scaling function this matrix has 
a band diagonal structure. The primal scaling function can be defined through 
the use of the two scale relation 

Ax) = ^Af/2)(l>{‘^x - ( 22 ) 

where S' is a suitable subset of The smoothness of the primal scaling function 
is dictated by its (M — 1)^* degree polynomial span which in turn depends on 
the M +1 non-zero values of </(^/2). The values </(^/2) can be calculated using 
a explicit relation as in [13]. Fast transform methods for the evaluation of the 
wavelet and scaling function coefficients are given in [15,17]. 

Irrespective of the choice of primal scaling function the transform vector that 
arises from the wavelet transform will have a structure of the form given below: 



J,D’ * J.l ■ 




i 


7, 2-^- I-* 


iA-l.O-" 


• ,d^ T 1 


i 


^J-1,2-^-1-1-' 


iA-l.O’-- 


• 7 1 Ad^jryr,, 

’ .7-1, 2^^-!-! ' J-2,0’ 


■ ■ ■ (i^ 1 . 

7-2,2‘/-2_l 1 *J-2,0 

i 


J-2,2^-2_l-f 


■ ■ 


■ 7 1 ■ ■ • ^ J-P-1,0’ 


“j-P-l,2^--P-l-l 1 ^J-P-1,0 





Wj_i e Wj _2 e Wj_30 ••• Vj_p 



(23) 
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involving P resolution levels and a finest discretization of 2“^. 

Algorithm complexity 

The number of of floating point operations required for the fast biorthogo- 
nal wavelet transform algorithm for P resolution levels is = 

2'^~^ — 1}. This comes from the fact that we require 2M Hlter coef- 
ficients to define the primal scaling function (p which spans the space of poly- 
nomials of degree less than M — 1. The calculation of the wavelet coefficients 
djf, for a given resolution j can be accomplished in 2(M — 1) + 1 floating point 
operations. The sub sampling process for the scaling function coefhcients 
requires a further 2-' operations and a total of 2-^+^M operations are required 
per resolution j. Thus for fixed J and P the complexity of the fast interpolating 
wavelet transform algorithm is 0(M) [15]. Since the finest resolution in a PDE 
spatial grid of N points is J = logj A, for fixed M and P the complexity of the 
transform is 0{N). 

Decomposition of differential operators 

If we define 5^“^ by 

■= Pvj^Pvjfix), 



then repeated application of the approximation subspace decomposition gives us 



V j=j-p 



dP 

da;" 



j-i 



Pv,_p + E Pw, fix). (24) 
V o=J-P / 



For example, the decomposition of the Hrst derivative operator ^ is given by 



d.i = Pvj.p + ^ Pwj 

V j=J-P 



d 

dx 



Pvj-p + ^ PWj j , 

V j=J-P / 



(25) 



where dj := Wd]jW ^ and W and W ^ are matrices denoting the forward and 



inverse transforms with 



^Pv., . (26) 

We can analyze dj instead of dj without loss of generality because the forward 
and inverse transforms are exact up to machine precision. The matrix dj has 
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a band diagonal structure and can be treated as a finite difference scheme for 
analysis. The biorthogonal expansion for ^ requires information on the interac- 
tion between the differentiated and undifferentiated scaling functions along with 
information about both the primal and dual basis functions. Using the sampling 
nature of the dual scaling function d}j can be written as 



d. 



1 

,7 



2"E 



J 

dx 



x=a — k^ 



(27) 



and using equation (18) we get 



a,/c 



d} = 



2"E 

2"^ Ea.fc 
.2"^ Ea.fc 



U=a-fc 4>J,a{x) 

4,k^ U=«-fc 



fc = 0, ... ,M-1 
fc = M, . . . , 2“^ - M 
fc = 2'^-M + l,... ,2'^. 



(28) 



The entire operator 5} can be determined provided the values of ^ \x=a-k 

can be obtained. An approach to determining filter coefficients for higher order 
derivatives is given in [13]. 

Extension to multiple dimensions 

The entire wavelet multiresolution framework presented so far can be extended 
to several spatial dimensions by taking straightforward tensor products of the 
appropriate ID wavelet bases. The imposition of boundary conditions on nonlin- 
early bounded domains is nontrivial, but these are fortunately rare in derivative 
valuation PDE problems which are usually Cauchy problems on a strip. 

The fast biorthogonal interpolating wavelet transform used with wavelet collo- 
cation methods for problems posed over d-dimensional domains exhibits better 
complexity than its alternatives. Indeed, since one basis function is needed for 
each collocation point, using a spatial grid of n points in each dimension there 
are N := points in the spatial domain to result in transform complexity 
0{n‘^) - versus 0(n‘^log2u) for the Fast Fourier Transform (where applicable), 
for an explicit finite difference scheme and 0{n^‘^) for a Crank- Nicholson 
or implicit scheme (which makes these methods impractical for d > 2, cf. [7]). 
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4 Wavelet Method of Lines 

In a traditional finite difference scheme partial derivatives are replaced with 
algebraic approximations at grid points and the resulting system of algebraic 
equations is solved to obtain the numerical solution of the PDE. In the wavelet 
method of lines we transform the PDE into a vector system of ODEs by replacing 
the spatial derivatives with their wavelet transform approximations but retain 
the time derivatives. We then solve this vector system of ODEs using a suit- 
able stiff ODE solver. We have implemented both a fourth order Runge Kutta 
method and a method based on the backward differentiation formula (LSODE) 
developed at the Lawrence Livermore Laboratories [16] . The fundamental com- 
plexity of this method is 0{rn'^) for space and time discretizations of size n and 
r respectively over domains of dimension d (cf. §3, [16]). 

An example 

Consider a first order nonlinear hyperbolic transport PDE defined over an inter- 
val 1? = [xi,Xr\ : 

du R / \ 

= -X (t) X = Xr. 

The numerical scheme is applied to the wavelet transformed counterpart of the 
above equations 

= -d^pu + X i dn, 

where pjZp ■= (-Pvj-p + Avi) and is the standard decompo- 

sition of -p defined as PjZ\>p:pjZ\,. In using the multiresolution strategy to 
discretize the problem we represent the domain P-|- 1 times, where P is the num- 
ber of different resolutions in the discretization, because of the P wavelet spaces 
and the coarse resolution scaling function space Vj_p, P > 1. In the transform 
domain each representation of the solution defined at some resolution p should be 
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supplemented by boundary conditions and [15] shows how to impose boundary 
conditions in the both the scaling function spaces and the wavelet spaces. 



5 Financial Derivative Valuation PDFs 



In this section we introduce briefly the PDEs for financial derivative valuation 
and the products we have valued using the wavelet method of lines described 
above. More details may be found in [7, 20] . 

Black Scholes products 

We have applied wavelet methods to solve the Black Scholes PDE for a vanilla 
European call option and two binary options. The Black Scholes quasilinear 
parabolic PDE is given by 



dC 1 BC ^ ^ 

dt^ 2"" ^ ds^ 



(29) 



where S is the stock price, a is volatility, r is the risk free rate of interest. We 
transform (29) to the heat diffusion equation 
du d^u 



dr dx"^ 

with the transformations 



for — 00 < X < 00, T > 0 



S:=Ke^, t:=T- 



2t 



where k = 2rja^, K is the exereise price and T is the time to maturity of the 
option to be valued. The boundary conditions for the PDE depend on the specific 
type of option. Eor a vanilla European call option the boundary conditions are: 

C(0,t)=0, C(5,t)~5 as S ^ oo 

C{S,T) =max{S - K,0). 

The boundary conditions for the transformed PDE are: 

u{x,t)=0 as X ^ —oo. 
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u(x,t) as x ^ oo, 

u{x, 0) = max(ei/2('=+^)^ - , 0). 

The first type of binary option that we solved was the cash-or-nothing call option 
with a payoff given by 

77(5') = Bn(S-K), 

where TL is the Heaviside function, i.e the payoff is B if at expiry the stock price 
5 > K. The boundary conditions for this option in the transformed domain are 



u{x,t)=0 as X ^ —oo, 

u(x,t) = as x ^ oo, 

K 

u{x, 0) = - K). 

K 

The second binary option we solved was a supershare call [20] option that pays 
an amount 1 /d if the stock price lies between K and A" + d at expiry. Its payoff 
is thus 

77(5) = i(77(5 -K)-H{S-K- d)) 

which becomes a delta function in the limit d ^ 0. The initial boundary condi- 
tion for this option is 

u{x, 0) = -^ei('=-i)'"(77(A'e^ - K) - n{Ke^ - AT - d)). 
dK 

For all of the above options the solution is transformed back to real variables 
using the transformation 

C{S,t) =775(fc+i)s'i(i-'=)ei(''+i)'‘""(^-‘)M(log(5/Ar),l/2cr2(r-t)), 



where k = 2r/<T^. There are closed form solutions for all the above options (see 
for example [20]). The Black Scholes solution for the vanilla European call option 
is 



C(S,t) = 5A7(di) - Ae-^(^-*)7V(d2) 



di : — 



log(5/77) + (r + |u^)(r-f) 
asAT-t) 
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d.2 : 



log{S/K) + {r-^a^){T-t) 



The solution for a cash or nothing call is 






The solution for the supershare option is 







Cross currency swap products 

A cross currency swap is a derivative contract between two counterparties to 
exchange cash flows in their respective domestic currencies. Such contracts are 
an increasing share of the global swap markets and are individually structured 
products with many complex valuations. With two economies, i.e one domestic 
and one foreign, there are different term structure processes and risk preferences 
in each economy and a rate of currency exchange between them. We will model 
the interest rates in single factor a extended Vasicek framework. 

To value any European-style derivative security whose payoff is a measurable 
function with respect to a filtration Tt we may derive a PDE for its value. The 
domestic and foreign bond prices and exchange rate are specified in terms of the 
driftless Gaussian state variables A^, Xf and Xs whose corresponding processes 
X(i, Xy and X 5 are sufficient statistics for movements in the term structure dy- 
namics. Let V = V{Xd,Xf,Xs-,t) be the domestic value function of a security 
with a terminal payoff measurable with respect to Tt and no intermediate pay- 
ments, and assume that V G (IR^ x [0,T)). Then the normalised domestic 
value process, defined by 




( 30 ) 
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satisfies the quasilinear parabolic PDE with time dependent coefficients given 
by 

at 2 dXj 2 f 2 ax| dXidXf dX^dXg dXjdXs 

(31) 

on IR^ X [0,T). Here the functions and are defined by 

H^^{s) := G^(s)A^(s) + Gj(s)Aj(s) + cr|(s) - 2pdf{s)Gd{s)Xd{s)Gf{s)Xf{s) 
+2pds{s)Gdis)Xd{s)as{s) - 2pfs[s)Gf{s)Xf{s)crs{s) 

H’^f{s) ■= pdf{s)Xd{s)Xf{s) 

H'^‘^{s) := Xd{s) [Gd(s)Ad(s) - pdf{s)Gf{s)Xf{s) + Pds{s)as{s)] 

:= A/(s) [pdf{s)Gd{s)Xd{s) - G/(s)A/(s) + p/s(s)crs(s)] (32) 

and the volatility is of the form 

Ufc(t,T) = [Gfc(r)-Gfc(t)]Afc(t) k = dj. (33) 



Gk{t) := 



1 - 



Xk{t) = e^'‘*Kk{t) 



k = d,f, 



(34) 



for some mean reversion rates ^d and f/, where Kfc(t) is the prospective vari- 
ability of the short rate. For the derivation of the PDE and further details of 
the extended Vasicek model see [7]. For a standard European-style derivative 
security we solve the PDE with the appropriate boundary conditions. 

The most common type of cross- currency swap is the exchange of floating or 
fixed rate interest payments on notional principals Zd and Zf in the domestic 
and foreign currencies respectively. We can also have a short rate or diff swap 
where payments are swapped over [0, T] on a domestic principal, with the float- 
ing rates based on the short rates in each country. A LIBOR currency swap is 
a swap of interest rate payments on two notional principals where the interest 
rates are based on the LIBOR for each country. The swap period [0, T] is divided 
into N periods and payments are denoted by pj. Now we describe precisely the 
deal that we are going to value which differs from that of [7], see also [10]. 



Wavelet Methods in PDE Valuation of Financial Derivatives 23 1 



Fixed-for-fixed cross-currency swap with a Bermudan option to cancel 

The cross-currency swap tenor is divided into Ncpn coupon periods. The start 
and end dates for these periods are given by Tq, . . . ,Tjv^p^ and cashflows are 
exchanged at coupon period end dates Tf, . . . Typically, the swap cash- 

flows consist of coupon payments at annualized rates on notional amounts 
for the first currency Zf for the second currency. In addition, notional amounts 
Zd and Zf may be exchanged at the swap start and/or end dates. The size of 
a coupon payment is given by: coupon rate x notional amount x coupon period 
day count. Both interest rates Rf and Rd are fixed at the outset of the contract, 
as opposed to those for a LIBOR swap where they are floating [7]. There is no 
path-dependence in the payoffs, i.e. the path taken is not relevant because the 
payoff is fully determined by component values at the payment date. Payments 
Pj are made at at the end of each period at time tj of size 

Pj = ^3 )Rf^f -m- ZdRd) , 

where m is the margin to the issuing counterparty. This is the terminal condition 
for the period [tj-i,tj). The value of the deal is the sum of the present values 
of all payments. 

When the contract has a Bermudan option to cancel, one of the counterparties is 
given an option to cancel all the future payments at times t\, . . . ,tn. Typically 
ti, . . . ,tn are set a fixed number of calendar days before the start date of each 
period, i.e ti = — ki, . . . — A, where A is the notification 

period. We assume that net principal amounts {Zd — S{0)Zf)) are paid at time 
0 and at time tjv if the option is not cancelled, or at time t^+i if the option is 
cancelled. The terminal condition at is given by 

V{tf,) = Sn {S{t-)Zf{l + Rf)-m- Zd{Rd + 1)) . (35) 

When the option to cancel is exercised at tk, we exchange coupon payments due 
on Tjv^jj^_ 2 _n+fc and notional amounts. The holder of the option will terminate 
the deal if the expected future value of the deal is less than the termination cost. 
Thus the decision at time t^+i-A is to continue if 

Pd{tk+i — A, < S{tk+i — A)ZfPf{tk+i — A, — Pd{tk+i — A, t^j^.^)Zd. 
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The deal is valued by dynamic programming backward recursion, solving the 
PDE for the last period using the terminal condition (35) and stepping back- 
wards in time using the condition 
V{tu+i - A) = min{Pd(tfc+i - t):+i)E(tfe+i), 

S{tk+i — A)ZfPf{th+\ — — Pd{tk+i — zi, 

for earlier periods. We then add on the exchange of principals at time 0. 

6 Numerical Results 

The numerical results using a ID and 3D implementation of the wavelet method 
of lines algorithm and the LSODE stiff vector ODE solver [16] are given below. 
In each case the numerical deal values are compared with a standard PDE so- 
lution technique and the known exact solution. Practical speed-up factors are 
reported which increase with both boundary condition discontinuities and spa- 
tial dimension. 

European call option 

Stock price: 10 Strike price: 10 Interest rate: 5% Volatility: 20% Time 
to maturity: 1 Year 

The exact value of this option is: 1.04505. 

Comparing tables 1 and 2 shows a speedup of 1.9. 



Table 1. Wavelet Method of Lines Solution 



Space Steps 


Time Steps 


Valne 


Solution Time in Seconds 


64 


60 


1.03515 


.05 


128 


100 


1.04220 


.10 


256 


200 


1.04502 


.13 


512 


200 


1.04505 


.30 


1024 


200 


1.04505 


.90 
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Table 2. Crank-Nicolson Finite Difference Method 



Space Steps 


Time Steps 


Value 


Solution Time in Seconds 


64 


60 


1.03184 


.02 


128 


100 


1.04184 


.04 


256 


200 


1.04426 


.09 


512 


200 


1.04486 


.16 


1024 


200 


1.04501 


.30 


2000 


200 


1.04505 


.57 



Cash-or-nothing call 

The option with the same parameters as the European call. 

The payoff is B.Ti{S — K), where S := 3 is the cash given, with a single discon- 
tinuity. 

The exact value of this option is: 1.59297. 

Comparing tables 3 and 4 shows a speed up of 2.5. 



Table 3. Wavelet Method of Lines Solution 



Space Steps 


Time Steps 


Value 


Solution Time in Seconds 


128 


100 


1.49683 


.10 


256 


200 


1.54904 


.13 


512 


200 


1.59216 


.30 


1024 


400 


1.59288 


1.02 
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Table 4. Crank-Nicolson Finite Difference Scheme 



Space Steps 


Time Steps 


Value 


Solution Time in Seconds 


128 


200 


1.46296 


.04 


256 


400 


1.53061 


.10 


512 


400 


1.56391 


.18 


1024 


400 


1.58046 


.31 


2048 


800 


1.58872 


1.35 


4096 


800 


1.59285 


2.56 



Supershare call 

Stock price: Strike price: W Parameter d: 3 Interest rate: b% Volatil- 

ity: 20% 

Time to maturity: 1 Year 

The option pays an amount 1/d if the stock price lies between K and K D i.e. 
the option has a payoff 1/ d.{H{S — K) — H{S — K — D)) with two discontinuities. 
The exact value of this option is: 0.13855. 

Comparing tables 5 and 6 shows a speed up of 4.9. 



Table 5. Wavelet Method of Lines Solution 



Space Steps 


Time Steps 


Value 


Solution Time in Seconds 


128 


100 


0.12796 


.10 


256 


200 


0.13310 


.14 


512 


200 


0.13808 


.30 


1024 


400 


0.13848 


1.04 
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Table 6. Crank-Nicolson Finite Difference Scheme 



Space Steps 


Time Steps 


Value 


Solution Time in Seconds 


128 


200 


0.12369 


.04 


256 


400 


0.13290 


.09 


512 


400 


0.13435 


.16 


1024 


400 


0.13666 


.34 


2048 


800 


0.13787 


1.35 


4096 


800 


0.13800 


2.56 


8000 


800 


0.13835 


5.11 



Cross Currency Swap 



Domestic fixed rate: 10%, Foreign fixed rate: 10% 

The exact value of this option is: 0.0 

Comparing tables 7 and 8 shows a speed up exceeding 81 . 



Table 7. Wavelet Method of Lines Solution 



Discretization 


Value 


Solution Time in Seconds 


20 X 8 X 8 X 8 


-0.00082 


1.2 


20 X 16 X 16 X 16 


-0.00052 


6.54 


20 X 32 X 32 X 32 


-0.00047 


40.40 


40 X 64 X 64 X 64 


-0.00034 


410.10 


100 X 128 X 128 X 128 


-0.00028 


4240.30 


160 X 256 X 256 X 256 


-0.00025 


53348.10 
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Table 8. Explicit Finite Difference Scheme 



Discretization 


Value 


Solution Time in Seconds 


20 X 8 X 8 a: 8 


-0.00109 


0.28 


20 X 16 X 16 a: 16 


-0.00101 


1.70 


20 X 32 X 32 a: 32 


-0.00074 


16.82 


40 X 64 X 64 a: 64 


-0.00058 


188.10 


100 a: 128 X 128 a: 128 


-0.00046 


2421.6 


160 a: 256 X 256 Al 256 


-0.00038 


33341.8 



7 Conclusions and Future Directions 

The wavelet method of lines performs well on problems with one spatial dimen- 
sion and discontinuities or spikes in the payoff. For example in the supershare 
option the wavelet method requires a lower discretization than the Crank Nicol- 
son finite difference scheme for equivalent accuracy (3 decimal places), as the 
discontinuities in the payoff can be resolved better in wavelet space. We also 
see that for the (prototype) cross currency swap PDF in 3 spatial dimensions 
the wavelet method outperforms the (tuned) explicit finite difference scheme by 
approximately two orders of magnitude - a very promising result. One of the 
important things to note is that 0{N) wavelet based PDF methods generalize 
O(A^logA^) spectral methods without their drawbacks. This lower basic com- 
plexity feature of the wavelet PDF method makes it suitable to solve higher 
dimensional PDFs. Further, to improve basic efficiency of the method we are 
currently implementing an adaptive wavelet technique in which the wavelet co- 
efficients are thresholded at each time step (c/. §3). This should result in an 
improvement in both speed and memory usage because of sparse wavelet repre- 
sentation. Such a technique has resulted in a further order magnitude speedup in 
other applications [18]. Future work will thus involve applying the wavelet tech- 
nique to solve cross currency swap problems with two and three factor interest 
rate models for each currency to result in solving respectively 5 and 7 spatial 
dimension parabolic PDFs. 
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Abstract. In this paper we consider the corporate default problem. One of 
the well-known approaches is to model the dynamics of the assets of the firm, 
and compute the probability that the assets fall below a threshold (which is 
related to the firm’s liabilities). When modeling the asset value dynamics as 
a jump-diffusion process (the most realistic model), a serious computational 
problem arises. In this paper we propose a fast method for computing the 
default probability. The new method achieves significant acceleration over the 
available approach. 

1 Introduction 

There are two main approaches for corporate default prediction. The first ap- 
proach uses financial statement data as inputs and uses a neural network or 
a regression model to predict the binary outcome default/no default (see for 
example [1]). The premise here is that the financial statement data such as rev- 
enues, earnings, debt, etc reflect the fundamental health of the firm. The other 
approach, invented by Merton [2] (see also [3]), is more of a “first principles 
approach”. It is based on modeling the value of the firm’s assets. The ability 
of the firm to fulfill its credit obligations is greatly affected by the relative value 
of the firm’s assets to its liabilities. By modeling the asset value as some pro- 
cess such as a Brownian motion, the probability of default can be obtained by 
computing the probability that the assets will fall below some threshold that is 
related to the firm’s liabilities. The advantage of such a model is that it gives the 
probability of default, rather than simply a default prediction. That can help 
in quantifying the risk for a credit portfolio. The drawbacks of such a model is 
the computational load involved in computing the default probability through 
lengthy Monte Carlo procedures. There are no closed-form solutions, except for 
the simple Brownian motion asset dynamics model. The more realistic jump 
diffusion process needs a computationally extensive simulation. In this paper 
we propose a fast Monte Carlo procedure to compute the default probability 
for such a model. The proposed procedure achieves several orders of magnitude 
speed-up over the conventional approach. 

2 Description of the Problem 

Merton [4] considered the jump-diffusion process, defined as 

dx = jidt adw — adq (1) 
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Figure 1: The evolution of the firm’s asset value which follows a jump-diffusion 
process 



where x is the log of the asset value (the asset value is a log-normal process), and 
li, a and a represent respectively the drift term, the volatility of the diffusion 
component, and the jump magnitude (which could be a constant or random). 
The jump instances are typically governed by a Poisson process, say of rate 
X. The diffusion term represents the gradual day-to-day changes in the firm’s 
prospects and value. On the other hand, the jump term represents sudden 
events, for example the loss of a major contract, or even gradual events that 
accummulate but are released by the management (to the investors and debtors) 
in one shot. 

Let the initial condition be x(0) = xq. Assume that we would like to estimate 
the probability of default in some interval [0, T] (T could represent the horizon 
of the debt). We can assume that the default threshold is zero (we can always 
translate x to achieve that). 

3 Overview of the Method 

The main problem with evaluating the default probability is that no analytical 
solution is known for the level crossing problem for a jump diffusion process. 
Hence we have to resort to simulations to solve our problem. Generating the 
diffusion process using a Monte Carlo simulation is very slow and impractical, 
especially in view of the fact that we have to perform the Monte Carlo simulation 
for every debt instrument in the portfolio, and this whole procedure has to 
be repeated many times if the parameters of the model have to be tuned. A 
known rule of thumb is to use a time sampling rate of 6 points a day, that is 
6 X 250 = 1500 points a year. So for example if the horizon of the debt is 5 
years, we have to evaluate 7500 points for every Monte Carlo iteration. 
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The method we propose is also a Monte Carlo-type method. The distinction here 
is that we typically have to evaluate 6 or 7 points every Monte Carlo iteration, 
thus it is several orders of magnitude faster than the plain Monte Carlo approach. 
The method is also very flexible. It can handle the following cases: 

1) the case when the shock magnitudes obey a certain density rather than 
constant a. 

2) the case when the shocks are correlated accross debt instruments. 

3) the case when the shocks obey a different density from Poisson, they obey 
a Poisson density with time varying rate of arrival, or if they are serially 
correlated. For example one might consider the case where shocks occur in 
bursts. 

The basic idea of the method is that first we generate the shock instances from 
the Poisson distribution. In between any two shocks the state variable follows a 
pure diffusion process. But, we know that for a diffusion process the level crossing 
problem has a closed form solution. The level crossing problem is often referred 
to as the first passage problem (see [5]). A closed form solution exists only for 
a diffusion process [6]. We utilize this solution and combine it with whether the 
shocks have caused default to obtain an estimate for the default probability for 
this particular run. We repeat this experiment in a Monte Carlo fashion, and 
compute the average to obtain our estimate of the default probability. Since 
typically there will be about 6 or 7 shocks for the horizon considered, we need 
to perform only this amount of computations per Monte Carlo iteration. The 
details of the method are described in the next section. 

4 Details of the Proposed Method 

Consider Figure 2 for the definitions of the variables. As we mentioned, we 
generate the shock times Ti,r 2 , ■ • • from a Poisson distribution. Then we gen- 
erate the state variables at the times Ti,T 2 , - ■ ■ immediately before the shock 
(say ' ')■ These are generated from a Gaussian distribution with 

appropriate mean and standard deviation. Consider for example that we would 
like to generate x{T~). Since we know the value of x{T^.^) (the value of the state 
at r,_i but immediately after the shock), and we know that the process from 
i(Ti+i) to x{T~) is a Brownian motion, we generate x{T~) from a Gaussian 
density of mean + c{Ti -Ti-i) and standard deviation a^Ti - Ti-i. 

Once we generate the data, we check whether any of the x{T~) or x{T^) is below 
the default threshold zero. If yes, we declare that for this particular Monte Carlo 
run default was certain (i.e. we consider Pdef = 1). If default has not occurred, 
then we have to check each of the intervals in between the shocks (T,_i, T,) and 
estimate the probability of default Pdef{i - TO in these (or to be precise the 
probability that x{t) goes below the default threshold in these intervals). 

Let A" be 1 -I- the number of shocks. To simplify the notation, let Tk = T, i.e. 
all Ti's are the shock times except the last one, which equals T. Once all the 
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intervals are considered, the probability of default in the whole interval [0, T] 
can be obtained as 

Pde/ = l-n^l[l-Pde/(i-l,0] (8) 

It can be proved that the inter-jump default probabilities are given by: 

2i,(T.“)*(T.+ 

e — ifx(r-)>0 0 

1 if x{T^ < 0 

The following is a summary of the algorithm: 

1) Generate Ti according to the Poisson distribution, by generating inter-jump 
times Ti - Tj-i from an exponential density. 

2) For j = 1 to M perform the following Monte Carlo runs (Steps 3-7): 

3) Starting from i = 1 to if (Ff is 1 -I- the number of jumps that happen to fall 
in [0, T]), perform the following: a) Generate x{T~) from a Gaus sian distri- 
bution of mean x{T^-^) + c{Ti - Ti_i) and standard deviation a^/Ti - Ti-i. 
(Let x(Tq) be the starting state a:(0)). b) Compute x{T^) as x{T~) — a. 

4) If any of x{T^),x{T^) are below zero (the default threshold), then set 

= 1. Go to 3) for another cycle. 

5) If x{T~),x{T^) are all positive, then perform Steps 6 and 7. 

6) For i = 1 to if - 1 compute 

2i(T.-)x(Tt j) 

Pdef{i - 1,0 = e ^ 



def 



[i - 1,0 = 
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7) Compute 

8) If j = M, i.e. we have completed all cycles of the Monte Carlo simulation, 
obtain estimate of the default probability: 

1 ^ 

^(default in [0,T]) = 



5 Simulations 

We have tested the proposed algorithm using artificial data. We considered a 
case with jj. = 0.05, cr = 0.3, A = 0.3, and a = 0.3. The prediction horizon 
is r = 5, and the initial value x(0) = 1. All time units are considered in 
years. In addition to our method, we simulated the plain Monte Carlo approach 
(i.e. simulating the actual path of the jump-diffusion process in a Monte Carlo 
fashion, we used a time step of 0.001 (very close to the rule of thumb). 

For both the proposed method and the standard Monte Carlo method, we ran 
1000 runs. The actual default probability was 0.2565. The standard Monte 
Carlo obtained an RMS error of 0.014 in a computational time of 200.9 CPU 
time. On the other hand, the proposed method obtained an RMS error of 0.011 
in computational time of 3.4 CPU time. One can see the superiority of the 
proposed method, especially in computational speed. 
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Abstract. The tasks of optimizing asset allocation considering transac- 
tion costs can be formulated into the framework of Markov Decision Pro- 
cesses(MDPs) and reinforcement learning. In this paper, a risk-averse re- 
inforcement learning algorithm is proposed which improves asset alloca- 
tion strategy of portfolio management systems. The proposed algorithm 
alternates policy evaluation phases which take into account the mean and 
variance of return under a given policy and policy improvement phases 
which follow the variance-penalized criterion. The algorithm is tested on 
trading systems for a single future corresponding to a Japanese stock 
index. 



1 Introduction 

Asset allocation and portfolio management deal with the tasks of constructing 
the optimal distribution of capital to varions investment opportunities including 
stocks, foreign exchanges and others. If transaction cost for changing allocations 
must be considered, it can be formulated into a Markov Decision Problem to 
maintain profitable portfolio for a given risk level in dynamic market [5] . 

Reinforcement learning [9] finds reasonable solutions (policy) for large state 
space MDPs by using varions fnnction approximators like nenral networks, lin- 
ear regression and so on. Typical reinforcement learning algorithms (e.g. Q- 
Learning[10] , ARL [2]), however, do not care risks of a policy. Although several 
techniques has been developed for reflecting the given risk level [4] [7], they do 
not estimate the variance of return. 

The authors derived a temporal difference (TD) algorithm for estimating 
the variance of return under a policy in MDPs [,1]. In this paper, we simplify 
the TD algorithm by making nse of characteristics of the MDPs formnlated in 
[6] . This simplification allows decision making rely on two value functions no 
matter how many assets mnst be concerned. And we propose a policy iteration 
algorithm using the TD algorithm and variance-penalized policy criterion [11]. 
Then we test the algorithm using nenral networks on the real world task of 
trading futures corresponding to a Japanese stock index. 

2 A MDPs formulation for Portfolio Management 

Markov Decision Processes (MDPs) [8] provide sequential decision making mod- 
els in stochastic environments. MDPs can be described by a state space X ! 'tm 
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Xt X't Xt + l X^^l 




X : State (M : market ) 

(A:portfolio) 
c: transition cost 
grgain of wealth 

P : stochastic trans. 

M 

P : deterministic trans. 

A 



Fig. 1 . A MDPs formulation for Portfolio Management 



action space U (x) of admissible control actions for every state x G X , a transition 
function V, an immediate reward function TZ and a policy /t of a decision maker 
(agent). The agent iterates the environment by observing state Xt G X , selecting 
action Ut G Li(xt), being evolved to next state Xt+i, and receiving immediate 
reward vt+i G iff. 

[6] formulated asset allocation into Markov decision problem under the fol- 
lowing assumptions: the agent is not able to influence the market by its trading 
actions and the agent has an infinite time horizon (Figure 1). In this formulation, 
state Xt — (MtjAt) consists of element Mt which characterizes the market and 
element At which characterizes the allocation of the capital at time t. Action 
space Ll[x) corresponds to the set of all admissible transactions on At. Imme- 
diate reward rt+i consists of transaction cost ct+i and gt+i characterizing the 
gain of wealth from time t to t + I . 

The agent aims to maximize some utility function on the discounted sum of 
reward for infinite horizon (i.e. return Rt — 1* i^t+i + gt+i))- Q-Learning, 

which is a typical reinforcement learning algorithm, aims to maximize the ex- 
pected return. Therefore it is an essential problem how estimates the expected re- 
turn for each state x and action a under policy /r. Typically, Q-value(i.e. (x,u) = 
E{Rt\xt = x,Ut = «,yu}) and state-value[i.e. V^[x) = E{Rt\xt = x,ju]) are es- 
timated. 

Ry introducing intermediate state [Mt,A't), state transition is resolved into 
deterministic transition Va and uncontrollable Markov chain Vm- Since the 
agent knows the deterministic transition function and cost function in asset allo- 
cation tasks, one optimal value function Q*(M, A') is simply needed for optimal 
decision making. The agent can obtain the optimal policy /r* by the following 
optimality equation: 

H*{M, A) — arg max[c(A, Va{A, ii)) + Q*{M, A')]. 

U 

In this formulation, Q-Learning estimates Q-vahies for each M and A' using the 
following equation and stochastic approximation algorithms: 

Q^{Mt,A't) — E[gt+i + 7[c(At_|_i, {Mt+i, At_^_i)]], 
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where gt+i — (liMt, Mt+i),A' = Va{A, ii{M , A)). Tf learning algorithms de- 
mand an improved policy, it can be obtained by the following operation: 

Hnew{Mt,At) - ar9max[c(A,w) + Q''{Mt,A[)]. 

U 

This learning algorithm (QLU algorithm [6]) converges to the optimal Q-values 
tinder the standard conditions and can be extended to large state space asset 
allocation by nsing function approximators. Besides, since this formnlation sim- 
ply needs one valne function, it can be extended to the tasks in which many 
assets mnst be considered. But QLU has a disadvantage not to care risks of the 
decisions. 

3 Variance-penalized Reinforcement Learning 

Within the MDPs formnlation for portfolio management in Fignre 1, a central 
problem is how estimates the valne functions of the intermediate state for given 
policy jU. Since an agent can predict the transaction cost and the resnltant asset 
allocation for each admissible transaction, if the agent can evalnate the valnes of 
the intermediate state, the most profitable transaction can be determined from 
the evaluated valnes and costs. 

We introdnce to the framework the variance-penalized criterion which is one 
of risk-averse criterion. Valnes of the intermediate states are evalnated by the 
following: 

E{R\M,A',i^} -aVar{R\M,A',i^}, 

where R is retnrn and a is a non-negative tradeoff parameter between risk and 
return. Here, we define q^{M,A') — Var{Rt\Mt — AI,A[ — A' , ji} and call it 
the q-vahie of state [M, A') for policy /t. Now, the central problem is estimations 
of the q- value function. 

For any MDPs, the variance of retnrn R, for any state x and action u, holds 
the following bellman equation: 

q^{xt,at) - E[{rt+i + ■yV^{xt+i) - Q^{xt, atjf + ■y^v^^ (xt+i)], 

where q^(x,u) = Var{Rt\xt = x,ut = u, ju] and v^[x) = Var{Rt\xt = x, ju]. By 
stochastic approximation algorithms nsing the above eqnation (TD algorithm), 
the estimated q(x,u) converges to qf‘(x,u) for any x and u [3]. 

For the simplified MDPs in Fig.l, q-values and Q-values holds the following 
eqnation: 

q^ [Mt , HO - U[(.9t+i + 7 (c(A+i , ) + Q" , Hj+Q) - Q" (M* ,A')f 

-+q^{Mt+i,A^_^_Q]. 

q^(M, A') also can be estimated by stochastic approximation. If the Q-value and 
fnnctions are obtained for policy /r, the new policy is obtained as follows: 



Hnew{Mt,At) ^ argiaaie[c{At,u) + Q^{Mt,A'^) - aq^{Mt,A[)]. (1) 
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1. collect samples of Mt and from training data set, random asset allocation 

A't and , At+i). 

2. train Q-value function using the following targets for each sampled t: 

Q^{Mt,A^) t— gt+i + j(c(Af, A^^i) + Q^{Mt+i, 

3. train g-ra/?je function using the following targets for each sampled t: 

A^) t— {gt+\ + 7(ct-i-i + Q^{Mt+i,A^^i)) — Q^{Mt,A^))^ 

4. generate policy fj„ew by equation (1) and evaluate it by some measure. 

5. if is superior to p, update p, the Q-value and g-na/^ie functions 
and go to 1, otherwise break. 

Fig. 2. Variance-penalized Policy Iteration for Portfolio Management 



Unfortunately, under the variance-penalized criterion, there does not exist a sim- 
ple optimality equation such like Bellman optimality equation. This means that 
jJnew is not guaranteed to superior to /r. Thus, policy jinew must be evaluated by 
some performance measure before replaces ji. In asset allocation, performances 
of trading results on training data or validation data can be such a measure. Note 
that incremental learning algorithms are not suitable for the criterion because 
they generate candidates of new policy at each time. 

Figure 2 shows the QqPT learning algorithm. Since the QqPT requires only 
two value tables or neural networks as function approximators independent of 
the number of assets to be considered, it can be extended to large-scaled tasks. 
Because the training of q-values requires the estimated Q-vahies, the training of 
q-value function had better carry out after the training of Q-value function. 



4 Experiments 

The framework of learning described above is also available for trading systems 
dealing with a single security. In this section we tested our algorithm on a task 
of trading futures which behaves like a Japanese stock index. 

At each time t, Onr agent can take one of long, neutral and short posi- 
tions (i.e. post G {“1 : lonq,0 : neutral,! : short]). The agent observes 
the A-nearest histories of prices and trading volumes of the security (i.e.Ht = 
{prit, . . . ,prit-N-\-i],volt, . . . , volt-N-\-i])- 

The reward is calculated by the following: 

A-i-i — St-\-i + Ct+i , 

^po.st(P^^ - l)+TC{\post-post-i\){!.0+post-i{^^^^ - 1)), 

prif prif 
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Fig. 3. The Neural Network Architecture 



where, TC (actually 0.2%) denotes the transaction cost and rt+i means the rate 
of increase of wealth (W) from time t to t + I (i.e. rt+i = Wt+i/Wt — 1). If the 
agent invests a fixed fraction of accumulated wealth in each long or short trade, 
return Rt correspond to. 



Rt = rt+i + '•/rt +2 + 7^rt+3 H , 



Wt+i 

Wt 



+ 7( 



Wt + 2 
Wt+i 



+ 7 ^( 



Wt+3 

Wt+2 



1 ) + ---. 



( 2 ) 



Onr reinforcement learning agent aims to maximize Eq.(2). 

Fig. 3 shows our network architecture. A pre-trained network for predicting 
trend [1] yields market status Mf from Hf. post is transformed into the portfolio 
part A't by standard binary transformation. Mt and compose the state space. 

We used two neural networks as function aproximators for both Q-value and 
q-value functions. The trend-predict network had 40-40-2 units and the output 
units are sigmoidal. The value-approximate networks had 5-10-1 units and the 
output units are linear. These networks are trained by back-propagation(2-fold 
cross-validation). Q-value and q-value are sampled from short trajectories which 
begin with random time and position. 

We applied QqPI to the Japanese stock index data (Jan. ,1992-Sep., 2000) with 
various a. We divided the data into training set (Jan. ,1992-Dec, 1996), validation 
set (Jan. ,1997-Dec. ,1998) and test set (Jan. 1999-Sep. 2000). The training set are 
used for tuning weights of the neural networks. The policy evaluation measure 
based on the simulation results on the training set and validation set. The best 
policies were applied to the test set. Table. I shows that the proposed algorithm 
with adequate tradeoff parameter learned stable policies. 
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5 Conclusion and Future Work 

A risk-averse reinforcement learning method has been proposed for improving 
asset allocation policy of portfolio management systems. The algorithm was 
tested on the real data and showed valid resnlts. Rnt more experiments are 
needed with varions feature extraction methods. Fnndamental trading and mnlti- 
asset tasks are other future directions. 



Table 1. Comparison of the capital gains 



Tradeoff param. Training set Validation set Test set 


a = 0.0 29.2 % 


0.8% 


26.2% 


a = 0.2 61.8 % 


12.4 % 


18.9% 


a = 0.4 59.4 % 


13.3 % 


9.4 % 



References 

[1] Baba, N.: A user friendly decision support system for dealing stocks using neural 
network. Proc. of IJCNN, Vol.l, (1993) 762-765. 

[2] Mahadevan, S.: Average reward reinforcement learning: foundations, algorithms, 
and empirical results. Machine Learning, vol.22, (1996) 159-196. 

[3] Makoto, S., Hajime, K., Shigenobu, K.: TD algorithm for the variance of return and 
mean-variance reinforcement learning. Journal of Japanese Society for Artificial 
Intelligence, (in Japanese, To appear). 

[4] Moody, J., Saffell, M.: Reinforcement learning for trading systems and portfolios. 
Proc. of Knn-98, (1998) 279-283. 

[5] Neimeier, R.: Optimal asset allocation using adaptive dynamic programming. Proc. 
of NIPS 9, MIT Press (1996). 

[6] Neuneier, R.: Enhancing Q-Learning for optimal asset allocation. Proc. of NIPS 

10, MIT Press (1997) 936-942. 

[7] Neuneier, R., Mihatsch, O.: Risk sensitive reinforcement learning. Proc. of NIPS 

11, MIT Press (1998) 1031-1037. 

[8] Puterman, M.L.: Markov Decision Processes. John Wiley & Sons, Inc., New York, 
(1994). 

[9] Sutton, R. S., Barto, A. G.: Reinforcement Learning: An Introduction. MIT Press, 
(1998). 

[10] Watkins, C.: Learning from delayed rewards. PhD thesis. King’s College, 
UK (1989). 

[11] White, D. J.: Mean, variance, and probabilistic criteria in finite Markov decision 
processes: A review. Journal of Optimization Theory and Applications, vol.56(l), 
(1988) 1-29. 




Applying Mutual Information to Adaptive 
Mixture Models 



Zheng Rong Yangf and Mark ZwolinskiJ 

fDepartment of Computer Science, Exeter University, Exeter EX4 4PT, UK 
Z.R.Yang@ex.ac.uk 

JDepartment of Computer Science and Electronics, Southampton University, 
Southampton S017 1B.J, UK 



Abstract. This paper presents a method for determine an optimal set 
of components for a density mixture model using mutual information. 
A component with small mutual information is believed to be indepen- 
dent from the rest components and to make a significant contribution 
to the system and hence cannot be removed. Whilst a component w'ith 
large mutual information is believed to be unlikely independent from the 
rest components within a system and hence can be removed. Continuing 
removing components with positive mutual information till the system 
mutual information becomes non-positive will finally give rise to a par- 
simonious structure for a density mixture model. The method has been 
verified with several examples. 



1 Introduction 

Many pattern recognition systems need to discover the underlying probability 
density function for the purpose of efficient decision making. Density mixture 
model is a powerful tool for pattern recognition. The basic computational ele- 
ment of a density mixture model is a component that has a nonlinear mapping 
function. Suppose that a set of components is 0 = {6k, k = and a 

component function of component 6k is K{6k), a density mixture model is then 
a mixture of mixing component functions on 0, = TiUkK{6k), where Uk is 

a mixing coefficient of component 6k satisfying tj/j. C 5R and = 1. One 

of the popular component functions is Gaussian K{6k) = N{jlk,al) [2], [5], [7], 
[10], [12], [13], [14], [16] such as Parzen estimator [8], where jlk € K"* and cr^ are 
the center and the variance of the component 6k respectively and d is dimen- 
sion. Since these non-parametric methods fix all the observation patterns as the 
centers of components, it leads to time and space inefficiency. To overcome this, 
parsimonious structures are desirable [12], [13], [14], [16]. 

Given a set of patterns D, instead of exhaustively searching for all the pos- 
sible subsets of n, the Kullback-Leibler distance, J = J p(x)ln(pr{x)/p(x)), 
has been employed to select the best representation (an optimal subset) of fl 
{fl* C O') [5]. p(x) is the Parzen estimator, which fixes all patterns from Q 
as the centers of components and Pr(x) is a reduced Parzen estimator, which 
employs a subset of fl. This method assumes that there is a best representation 
of D. If D does not contain a best representation, the method will not be able to 
give an accurate density estimator. Moreover, this method fixes the number of 
components for a density mixture model. In general regression neural network, 
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the centers of components were sought through employing a forgetting function 
dynamically [12], [13]. Covariance matrix was also used for probabilistic neural 
network [14]. These methods avoided fixing original patterns as the centers of 
components, but the learning mode is homoscedastic, which is not robust for 
most real applications. Since heteroscedastic training of the Parzen estimator 
using the Expectation-Maximization (EM) algorithm frequently falls into a local 
minimum [15], a robust statistical method called the .Jackknife was used to build 
a robust estimator [16]. But how to determine an optimal set of components for 
a density mixture model is a trial-and-error method in that procedure [16]. 

All the above work selected components using the information contained 
in the patterns rather than using the information contained in the compo- 
nents themselves. It is therefore not easy to make these techniques adaptive. 
A decision-to-add-one rule was therefore developed to select an optimal set of 
components using the information from both patterns and components [9]. The 
decision to create a new component was determined by whether a new pattern 
contains information, which is not found in the existing components. If a data 
space is not well ordered, the computational cost of reconstruction may be large. 
Moreover, this method is very sensitive to noise. 

This paper presents a new method, which determines an optimal set of com- 
ponents by investigating the relationship between components. If two compo- 
nents are mutually dependent their relationship will be strong. Hence mutual 
information [11] could be used to measure whether one component is strongly 
dependent on the rest components. If a component has small mutual information 
with respect to the rest components in a system, the component is assumed to be 
independent from the rest components. Thus the component makes a significant 
contribution to the system and hence cannot be removed. If the mutual infor- 
mation of a component is large it is not independent from the rest components 
in a system and removing it will not significantly change the system probability 
density function. Continuing removing components with the largest and posi- 
tive mutual information until system mutual information becomes non-positive, 
an optimal set of components can be found. The probability density function 
constructed by a density mixture model using this optimal set of components 
will be an approximation to the true probability density function. 

2 Density mixture model 

Both the greatest gradient and the maximum likelihood can be used to esti- 
mate the parameters of a density mixture model. It has been shown that 
the maximum likelihood method is a fast training procedure for some cases 
[15]. Suppose that there are n patterns and m components, each of which 
has a Gaussian probability density function. Let tjjt, jlk and (3k = 1/cr^ be 
the mixing coefficient, the center and the smoothing parameter for compo- 
nent 6 k respectively, a density mixture model is p{x) = ^kP{x\ 6 k), where 

p{x\ 6 k) = (/9it/7r)“*/^exp(— /litljx — pifelP). A logarithmic likelihood function by 
considering an optimization term is L = In HlLi ~ !)• Let the 

derivatives of the log likelihood function with respect to each parameter be zero 
leads to an estimate of the centers, variances, the mixing coefficients of the com- 
ponents: jlk = iPky^ = - jlkW^/d and tJk = '^f^Ai,k 

respectively, where ^i^k = L 0 kp{xi\ 6 k)/p{xi) and %^k = ^i,k/'^f^i^i,k- 
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3 Mutual information theory 

Mutual information theory is used to measure the information shared among 
objects and aims to minimize the entropy within a system [11]. It has been ap- 
plied to a number of areas [1], [3], [4], [6]. If the information shared between two 
objects is small, the two objects are likely to be independent. Otherwise the two 
objects are likely to be dependent on each other. The necessity of a component 
is therefore determined by the mutual information between the component and 
the rest components within a system. If the mutual information of a component 
is large, that component is unlikely to make a significant contribution to the 
system probability density function because it is not independent from the rest 
components and this component can be removed from the system. The mutual 
information is a difference between initial and conditional uncertainty. 

Let 0 be the component space, 9i E Q and 6j G 0 are two components 
and p{6i) and p{9j) are the probabilities of the components 9i and Oj. The 
initial uncertainty of Oi is H{9j) = —p{9i)ln{9i). This initial uncertainty is mea- 
sured when 9i is isolated. Let p(9i\9j) the conditional probability of p{9i) given 
p{9j). The conditional uncertainty of 9i with respect to the rest components is 
H{9i\Q~’') = —T,p{9j)p(9i\9j)ln{p{9i\9j)), where 9j G 0“* and 0“® = 0 — {9i}. 
This conditional uncertainty measures the information of 9i given the condition 
that the rest components in 0 (denoted by 0“*) exist in the same system. The 
mutual information of 9i with respect to the rest components is then the dif- 
ference between the initial uncertainty of 9i and the conditional uncertainty of 
9i, I[9i,e-^) = H{9i) - H{9i\e-^)'£i{9i,9j), where 9j G 0“* and i(9i,9j) = 
p{9i,9j)ln{p{9i,9j) /p{9i)p{9j)). The joint probability 0j) can be computed 
by a geometrical method: divide a space into a set of hyper-cubes, probability 
density is computed by dividing the number of components within a hyper-cube 
over the volume of a hyper-cube [1]. In this paper, p{9i,9j) is calculated by 
applying Bayesian rule, 0j) = p{9i\9j)p{9j) or p{9i, 9 j) = p{9j\9i)p{9i). The 
system mutual information is defined as 1(0) = T,p{9i)I{9i, 0“®), where 9i G 0. 
The system mutual information indicates whether a system has arrived at a sta- 
ble point where all the components in the system are mutually independent. The 
component mutual information 7(0j,0“®) denotes whether 9i has a significant 
and independent contribution to the system probability density function. The 
mutual relationship i(9i, 9j) measures the mutual relationship between a specific 
pair of components (9i and 9j) or the probability that 9i and 9j exist within a 
system at the same time. If they are mutually dependent the mutual informa- 
tion between them will be large. There are three possible values for i(9i,9j); 
negative, zero and positive. A zero value means that they are mutually indepen- 
dent p(9i,9j) = p(9i)p{9j). A negative value of mutual information means that 
they can be regarded as much less dependent because p(9i,9j) < p{9i)p(9j). A 
positive value means that they are mutually dependent p(9i,9j) > p{9i)p(9j). 
Only in this situation, one of the two components will be considered to remove 
from a system. 
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4 The algorithm and examples 

The algorithm for an adaptive mixture model using mutual information the- 
ory may be stated as follows: Step 0. Given 0 = {6k, k = and 

n = [xi,i = 1, . . . ,n}. 

Step 1. maxL. 

Step 2. If J(0) < 0 and I{6i,Q^’‘) < 0,'i9i G 0 stop, otherwise goto step 3. 

Step 3. Compute the system mutual information 1(0), the component mu- 
tual information set A = {J(/Ij,0^®)} and the mutual information pair set 
B = 

Step 4. 0 t— Q—{6jrOr6T}, where 0^^^) > 0 and 0^^^) = max{I{6i, 0^®)}, 
G 0 as well as = max{i{6T^,6i)), > 0 and 6t^ 6i- Goto 

step 1. 



Table 1 The process of adapting a mixture model for example 1 



no, /(0) 


i 


Pi 


Oi 


LOi 


I(6i,Q-^) 


max{6i,6j) 


j 


5, 0.028 


1 


-0.11 


0.74 


0.11 


0.037 


0.046 


5 


2 


2.47 


0.92 


0.50 


-0.059 


- 


- 


3 


3.66 


0.66 


0.19 


-0.005 


- 


- 


4 


1.33 


0.91 


0.12 


0.060 


0.054 


2 


5 


0.29 


0.82 


0.08 


0.106 


0.092 


1 


4, 0.002 


1 


-0.15 


0.70 


0.16 


-0.018 


- 


- 


2 


3.16 


0.83 


0.50 


-0.045 


- 


- 


3 


1.98 


0.89 


0.14 


0.091 


0.075 


4 


4 


1.36 


0.70 


0.20 


-0.018 


- 


- 


3, -0.038 


1 


0.24 


0.68 


0.13 


-0.020 


- 


- 


2 


3.28 


0.79 


0.41 


-0.035 


- 


- 


3 


1.70 


1.00 


0.46 


-0.021 


- 


- 



The first example was a mixture of three Gaussian distributions: 0.3A1'(0.5, 1.25^)-l- 
0.27V (2, 1.5^) -I- 0.5Ai'(3, 1^). The system started with five components and was 
stable when only three components are left. Table 1 gives the experimental re- 
sults. The system mutual information was 0.028 with five components. Three 
components had positive mutual information (0.037, 0.06 and 0.106). The first 
component and the fifth component were dependent on each other. The fifth 
component was removed since it had the largest positive mutual information. 
When four components were left it was found that the system mutual informa- 
tion was still positive (0.002) and that the third component had positive mutual 
information (0.091). Hence the third component was removed. This removal led 
to a negative system mutual information (-0.038) and none of the three compo- 
nents had positive mutual information. The system was then stable. Figure 1 
shows the true probability density function with the solid line and the approxi- 
mated probability density function with the dotted line. 

The second example was also a mixture of three Gaussian distributions: 
0.3iV(0.5, l.l2) -h 0.2iV(4.2, 1.2^) -h 0.5iV(7.4, 1.3^). Initially setting up five com- 
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ponents led to the system mutual information of 0.01 and the fifth component 
had the largest positive mutual information (0.042). After removing this com- 
ponent the system mutual information was negative (-0.001). However the first 
component still had positive mutual information (0.001). It was therefore re- 
moved. When only three components were left in the system the system mutual 
information became negative (-0.016) and none of components had positive mu- 
tual information. The true probability density function (solid line) and the 
approximated probability density function (dotted line) are plotted in Figure 2. 





Figure 1: True and the simulated pdfs Figure 2: True and the simulated pdfs 
for example 1 for example 2 



Table 2 The process of adapting a mixture model for example 2 



no , 7 ( 0 ) 


i 




Oi 


OJi 


/(« i , 0 -*) 


max { 6 i , Oj ) 


j 


5 , 0.010 


1 


- 0.56 


0.42 


0.06 


0.000 


- 


- 


2 


7.57 


1.01 


0.42 


- 0.029 


- 


- 


3 


4.81 


1.42 


0.30 


- 0.014 


- 


- 


4 


1.18 


0.66 


0.13 


- 0.001 


- 


- 


5 


0.33 


0.41 


0.09 


0.042 


0.041 


4 


4 , - 0.001 


1 


- 0.19 


0.59 


0.11 


0.001 


0.001 


4 


2 


7.52 


1.03 


0.44 


- 0.022 


- 


- 


3 


4.69 


1.33 


0.27 


- 0.013 


- 


- 


4 


0.98 


0.74 


0.18 


- 0.012 


- 


- 


3 , - 0.016 


1 


0.53 


0.88 


0.28 


- 0.004 


- 


- 


2 


7.40 


1.10 


0.49 


- 0.029 


- 


- 


3 


4.38 


1.25 


0.23 


- 0.014 


- 


- 



Summary 

A novel algorithm for an adaptive mixture model using mutual information has 
been presented in this paper. The experimental results indicate that apply- 
ing mutual information to adaptively selecting an optimal set of components is 
robust because mutual information well explores the nonlinear dependence be- 
tween components. Removing a component with the largest mutual information 
leads to a maximal reduction of the system’s mutual information. Removing 
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such components until system mutual information approaches zero or becomes 
negative. From this, an optimal set of component for estimating the underlying 
probability density function is obtained. 
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Abstract. Company performance is commonly evaluated based on fi- 
nancial ratios, which are derived from accounting figures contained in 
financial statements. How to understand the relationship between com- 
pany performance and financial ratios is the key to selecting a suitable 
technique for company performance evaluation. This paper demonstrates 
that if financial ratios do not possess stable profiles the relationship be- 
tween company performance and financial ratios is nonlinear and the 
empirical results show that over half financial ratios from the UK con- 
struction industry do not have stable profiles. 



1 Introduction 

Company performance evalnation is based on financial ratios, each of which is 
a proportion of one (or more than one) accounting figure to another (or more 
than one) accounting figure contained in financial statements. The statistical 
distribution of financial ratios has been intensively studied [4], [6], [7], [9] and 
will not be discussed in this paper. This paper analyses the relationship between 
company performance and financial ratios for the purpose of selecting a suitable 
technique for company performance evaluation. 

It was widely assumed that the relationship between company performance 
and financial ratios is linear [1], [2], [3], [5], [7], [11], but recently Salchenberger 
et al indicated that the relationship between company performance and financial 
variables is nonlinear [9]. There are however few reports on analysing the rela- 
tionship between company performance and financial ratios so far although such 
analysis is the key to select a suitable technique for company performance analy- 
sis. Since company performance evaluation is commonly based on financial ratios 
derived from financial statements in different years, the relative relationship in 
financial ratios between failed and non-failed companies may change according 
to economic change in different years. Whether financial ratios keep stable pro- 
files in different years will then be questioned. For example some profitability 
ratios from non-failed companies are not always larger than those from failed 
companies in different years. It is found that it is non-stable profile of financial 
ratios that makes the relationship between company performance and financial 
ratios nonlinear. The empirical study shows that on average 58 percent of ratios 
in UK contrusction industry do not have stable profiles and then the relation- 
ship between UK construction companies and their financial ratios is complex 
or nonlinear. 



2 Profile analysis 

If the mean value of a ratio (mean ratio) calculated from failed (non-failed) com- 
panies is always larger than that calculated from non-failed (failed) companies 
in different years, the ratio is termed as having consistent differences between 
failed and non-failed companies [8] . Such a ratio has a stable profile. Otherwise 
a non-stable profile. The mean values calculated from failed and non-failed coni- 
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panics are plotted year by year. Whether a ratio possesses a stable profile or not 
can therefore be visualised. In Figure 1, the solid lines represent the mean ratios 
of non-failed companies; the dotted lines are the mean ratios of failed companies. 
The ratios plotted in Figure 1 and 2 have stable profiles. On the other hand, 
the ratios shown in Figure 3 and 4 do not possess stable profiles. 




Figure 1: Stable profiles 



Figure 2: Stable profiles 




Figure 3: Non-stable profiles Figure 4: Non-stable profiles 

The impact of profile stability of financial ratios on the relationship between 
company performance and financial ratios is analysed by combining ratios to see 
whether a combination results in a consistent decision rule, or a linear decision 
space. 

The impact of profile stability of financial ratios on the relationship between 
company performance and financial ratios is analysed by combining ratios to see 
whether a combination results in a consistent decision rule, or a linear decision 
space. At the first, the situation showm in Figure 1 and 2 is considered where 
ratio 1 and ratio 2 have stable profiles. If the sampling years are fixed at A and 
B, four companies are selected as shown in Figure 1 and 2, where FI and F2 are 
two failed companies, NFl and NF2 are two non-failed companies. Each of them 
has two ratios; there are therefore eight points in Figure 1 and 2 shown as squares 
and circles. For example, the non-failed company at year A is marked as ’NFl’ 
and has the value 4 for ratio 1 and the value 2 for ratio 2. Combining these two 
ratios together, a two-dimensional data space is formed as shown in Figure 5. It 
can be seen that the combination leads to a linear relationship between company 
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performance and financial ratios because the decision space composed of these 
two ratios is linearly divided into two half parts, each of them represents one 
particular group of companies, failed or non-failed companies. The closer to the 
right down corner, the better (stronger) the company performance, the closer 
to the left top corner, the worse (weaker) the company performance. A linear 
relationship is therefore satisfied. Hence a combination of two ratios possessing 
stable profiles comprises a linear relationship between company performance and 
financial ratios. 




Figure 5: Combination of Ratio 1 and Figure 6: Combination of Ratio 3 and 
2 4 

Now consider the situation shown as in Figure 3 and 4, whore ratio 3 and 
ratio 4 do not possess stable profiles. If the sampling years are fixed at C 
and D, another four companies named as F3, F4, NFS and NF4 are selected 
where FI and F2 are two failed companies and NFl and NF2 are two non-failed 
companies. Each of them has two ratios shown as eight points in Figure 3 and 4. 
Ratio 3 and ratio 4 can be combined into another two-dimensional data space, 
see Figure 6. It can be seen that the combination of these two ratios results 
in a nonlinear relationship between company performance and financial ratios 
because there is no linear decision space like that in Figure 6. Therefore whether 
financial ratios have stable profiles or not is the cause of whether the relationship 
between company performance and financial ratios is linear or not. 



3 The empirical result 

Financial statements of 2408 UK companies from 1989 to 1994 were collected in 
this study. Among this set of data, there are 2244 non-failed companies and 164 
failed companies. Thirty-three ratios were calculated based on this raw data, 
see Table 1. Among these 33 ratios, there are nine liquidity ratios, nine assets 
structure ratios, 12 profitability ratios and three gearing ratios. Liquidity ratios 
are used to indicate the possibility of the short-term survival of a company [6]. 
They show whether a company is able to meet its immediate obligations and 
whether a company has enough money to pay back its credits. Gearing ratios 
(or capital structure) are used to indicate extent to which company is expected to 
experience financial risk [4] . Profitability ratios are used for measuring whether 
a company is able to earn an acceptable return to continue its business by the 
contributions from income, capital, assets and funds [6]. Asset structure (or 
activity) ratios usually help managers and outsiders to judge how effectively a 
company manages its assets [4]. 
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The analysis result is shown in Table 2. It can be seen that 19 ratios do 
not have stable profiles, where 58 percent of profitability ratios, 56 percent of 
liquidity ratios, 67 percent of gearing ratios and 56 percent of assets structure 
ratios do not possess stable profiles. Therefore the relationship between company 
performance and financial ratios is nonlinear in this particular industry. 



Table 1 The 33 Key Ratios 



RATIOS 


CLASSIFICATION 


DEI'INITIONS 




Liquidity 


CA:current assets 


DBT/EQT 


Gearing 


DBT:debt; EQT:equit 




Assets structure 


TUrturnover 




Profitability 


PAT:proht after taxes 


nSHMSUUBKiU^H 


Gearing 


STL:short term loan 




Liquidity 


CL:current liabilities 


111 


Assets structure 


FArhxed assets 


WC/TA 


Assets structure 


WC:working capital 


■laaiiEBiiMaTJ— 


Profitability 


PBITiprofit before ints and taxes 


■laayieaiiaiaiiMMii 


Profitability 


STK: stock and work in progres 




Profitability 


DEP:depreciate 




Assets structure 


TArtotal assets 




Assets structure 


LTLdong term loan 




Assets structure 




PBIT/EQT 


Profitability 






Gearing 




■annMaTiin^^^^M 


ProhtaDility 


PBTiproht before interest 




Profitability 






Liquidity 




■J 


Profitability 






Liquidity 




■ ; WiMTI H<W 


Profitability 


NAmet assets 




Assets structure 






Assets structure 




■ iMiirj'j 


Liquidity 




■aaMMiMfewaiy 


Profitability 






Liquidity 


CRDxreditor 




Liquidity 




PBIT/CA 


Profitability 




NCI 














Liquidity 




PBT/avg CL 


Profitability 





Table 2 The analysis result of stable profiles 



DEFINITION 


TOTAL 


INSTABLE 


percent 


Profitability 


12 


7 


58 




9 


5 


5t) 


Gearing 


3 


2 


67 


Assets structure 


9 


5 


5t) 


Ibtal 


33 


19 


58 



Two ratios (PAT/TA and (PBIT+DEP)/TA) with stable profiles are plotted 
in Figure 7 and 8 and two ratios (DBT /EQT and C A/NA) without stable profiles 
are plotted in Figure 9 and 10. In Figure 3, the filled circles indicate the mean 
values of the non-failed companies and the open circles denote the mean values 
of the failed companies. 
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Figure 11 is a combination of two financial ratios (PAT/TA and (PBIT+DEP) /TA) 
both having stable profiles. It can be seen that a straight line can be drawn to 
separate failed and non-failed companies without any difficulty. Therefore the 
relationship between company performance and financial ratios is linear if only 
consider these two ratios for decision making. Figure 12 is a combination of 
one financial ratio (PAT/TA) with a stable profile with other financial ratio 
(CA/NA) without a stable profile. It can be seen that this combination still re- 
sults in a linear relationship between company performance and financial ratios 
if only consider these two ratios for decision making. Figure 13 shows a combi- 
nation of two financial ratios (DBT/EQT and CA/NA) both having non-stable 
profiles. As expected that this combination leads to a nonlinear relationship 
between company performance and financial ratios because it is impossible to 
separate failed and non-failed companies by one straight line if only consider 
these two ratios. Considering put all the 33 ratios for decision making (UK 
construction companies evaluation) , it can be concluded that the decision space 
would be very complex and it does need to access the ability of nonlinear tech- 
niaues such as neural networks. 




Figure 7: A ratio with a stable profile Figure 8: Another ratio with a stable 

profile 




Figure 9: A ratio without a stable pro- Figure 10: Another ratio without a sta- 
file ble profile 
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Summary 

This study indicates that profile stability of financial ratios is the main cause of 
the complexity of the relationship between company performance and financial 
ratios. The analysis on the data set collected from UK companies gives the 
evidence that the relationship between company performance and financial ratios 
is indeed nonlinear. From this, it can be seen that univariate analysis is unable 
to evaluate company performance when such a complexity presents. The non- 
stable profiles lead to non-consistent decision rules between failed and non-failed 
companies. This causes nonlinearity. Hence using nonlinear techniques such as 
neural networks for company performance evaluation based on company financial 
ratios is expected to produce better results. The further work will include the 
variances of ratios in study. 
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Abstract. In this paper we introduce a new recurrent network archi- 
tecture called ECNN, which includes the last model error measured as 
an additional input. Hence, the learning can interpret the models misfit 
as an external shock which can be used to guide the model dynamics 
afterwards. As extentions to the ECNN, we present a concept called 
overshooting, which enforces the autoregressive part of the model, and 
we combine our approach with a bottleneck coordinate transformation 
to handle high dimensional problems (variants-invariants separation). 
Finally we apply the ECNN to the German yield curve. Our model allows 
a forecast of ten different interest rate maturities on forecast horizons 
between one and six months ahead. It turns out, that our approach is 
superior to more conventional forecasting techniques. 



1 Modeling Dynamic Systems by Error Correction 

If we have a complete description of all external forces ut influencing a determin- 
istic system yt, Eq. 1 would allow us to identify temporal relationships by setting 
up a memory in form of a state transition equation S(. Unfortunately, our knowl- 
edge about the external forces Ut is usually limited and the observations made, 
are typically noisy. Under such conditions learning with finite datasets leads to 
the construction of incorrect causalities due to learning by heart (over-fitting). 
The generalization properties of such a model are very questionable. 

St = f{st-l,Ut) ,, 

yt = g(st) 

If we are unable to identify the underlying dynamics of the system due to un- 
known influences, we can refer to the observed model error at time period t — 1, 
which quantifies the misspecification of our model. Handling this error flow as 
an additional input, we extend Eq. 1 obtaining Eq. 2, where yt-i denotes the 
output of the model at time period t — 1. 

st = f{st-i,ut,yt-i-y^_i) 

yt = g(st) ^ ^ 

Keep in mind, that if we had a perfect description of the dynamics, the extension 
of Eq. 2 would be no longer required, since the model error at t — 1 would be 
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equal to zero. In case of imperfect information, the model uses it’s own error 
flow as a measurement of unexpected shocks. This is similar to the MA part of 
an linear ARIMA model. Working with state space models, we can skip the use 
of delayed error corrections. 

1.1 Error Correction Neural Networks 

An neural network implementation of the error correction equations (Eq. 2) can 
be formulated as 



St = tanh(Ast_i + But + D tanh(Cst_i - yf_i)) 

Vt = Cst ^ ^ 

t=\ 

The neural network of Eq. 3 stated in an error correction form measures the de- 
viation between the expected value Cst-i and the related observation yf-i. The 
term Cst-i recomputes the last output yt-i and compares it to the observed 
data yf-\. The transformation D is necessary in order to adjust different dimen- 
sionalities in the state transition equation. While matrix B introduces external 
information Ut to the system, the model error is utilized by D. For numerical 
reasons, we included tanh(-) nonlinearity. The system identification (Eq. 4) is a 
parameter optimization task adjusting the weights of matrices A, B, C, D. 

1.2 Unfolding in Time of Error Correction Neural Networks 

In a next step we want to translate the formal description of Eq. 3 into a recurrent 
network architecture, which unfolds over several time steps using shared weights, 
i. e. the weight values are the same at each time step of the unfolding [5]. We 
call this architecture Error Correetion Neural Network (ECNN) (see Fig. 1). 



A A 




Fig. 1. Error Correction Neural Network 



The ECNN architecture (Fig. 1) is best to understood if one analyses the 
dependency between st-i, ut, zt = Cst-i — yf_i and st- 
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We have two input types: (i.) the external inputs Ut directly influencing the 
state transition and {ii.) the targets yf. Only the difference between the internal 
expected yt and the observation yf has an impact on state transition. Note, that 
—Id is an negative identity matrix frozen during the learning. 

At time period t + 1, there is no compensation yt+\ of the internal expected 
value 2 /^+ 1 , and thus the system offers a forecast. This design also allows an 
elegant handling of missing values: if there is no compensation of the internal 
expected value yt = Cst-i the system automatically creates a replacement yf. 

The output clusters of the ECNN which generate error signals during the 
learning are the Zt-r- Have in mind, that the target values of the sequence of 
output clusters Zt-r are zero, because we want to optimize the compensation 
mechanism between the expected value yt-r and its observation yf_T-. 



1.3 Overshooting in Error Correction Nenral Networks 



An obvious generalization of the ECNN in Fig. 1 is the extension of the au- 
tonomous recurrence in future direction t + l,t + 2, - ■ ■ (see Fig. 2). We call this 
extension overshooting (see [5] for more details). 




Fig. 2. Combining Overshooting and Error Correction Neural Networks 



Note, that overshooting generates additional valuable forecast information 
about the dynamical system and acts as a regularization method for the learning. 
Furthermore, overshooting influences the learning of the ECNN in an extended 
way: A forecast provided by the ECNN is based on a modeling of the recursive 
structure of a dynamical system (coded in the matrix A) and the error correc- 
tion mechanism which is acting as an external input (coded in C, D). Now, the 
overshooting enforces the autoregressive substructure allowing long-term fore- 
casts. Of course, in the overshooting we have to provide the additional output 
clusters yt+i,yt+ 2 , ■ ■ ■ with target values in order to generate error signals for 
the learning. Note, that due to shared weights overshooting has the same number 
of parameters as the basic ECNN (Fig. 1). 
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2 Variants-Invariants Separation combined with ECNN 

In this section we want to integrate the dimension reduction concept of coor- 
dinate transformation (so called variants-invariants separation) into the ECNN 
(Fig. 3) in order to model high dimensional dynamical systems (see [5]). 

The separation of variants-invariants can be realized by a bottleneck neu- 
ral network (left hand-side of Fig. 3). The compressor F separates into variants 
and invariants, while the decompression is done by matrix E reconstructing the 
complete dynamics. 

Combining the latter concept with ECNN (Fig. 3), the compressor / de- 
compressor network seems to be disconnected from the ECNN, however this 
isn’t true: Since we use shared weights the two subsystems influence each other 
without having an explicit interconnection. 




Fig. 3. Combining Variance - Invariance Separation and Forecasting 



Thus, the ECNN has to predict a coordinate transformed low dimensional 
vector xt instead of the high dimensional vector yt- Note, that the ECNN re- 
quires —yf as inputs in order to generate —xf in the Zt layer. This allows the 
compensation of the internal forecasts Xt = Cst-i by the transformed target 
data —xf = F{—yf). 

3 Application: Yield Curve Forecasting by ECNN 

Now, we apply the ECNN combined with the separation of variants-invariants 
(see Fig. 3) to the German bond market to forecast the complete yield curve 
(REXl - REXIO). 

Our empirical study can be characterized as follows: We are working on 
the basis of monthly data from Jan. 1975 to Aug. 1997 [266 data points] to 
forecast monthly, quarterly and semi-annual changes of the German yield curve, 
i. e. predicting interest rate shifts of REXl to REXIO for these forecast horizons 
simultaneously. The data is divided into two subsets: the training set covers the 
time period from Jan. 91 to Aug. 95 [234 data points], while the generalization 
set covers the time from Oct. 95 to Dec. 97 [32 data points]. 

Since we are interested in forecasting 1, 3 and 6 months changes of the Ger- 
man yield curve, the applied ECNN uses 6 month of background information 
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to model the present time state and includes an overshooting environment of 
six month, i. e. there are six autonomous iterations of the dynamical system. In 
our experiments we found, that for the German yield curve only 3 variants are 
important. Furthermore, composing the internal state of the ECNN of 8 neu- 
rons, allows a fairly good description of the underlying dynamics. We trained 
the ECNN until convergence with the vario-eta learning algorithm using a small 
batch size of 15 patterns (see [2]). 

As a further decision, we composed a data set of external factors in order to 
forecast the German yield curve. Considering economical trends, inflation, stock 
markets and FX-rates of Germany, USA and Japan, we obtained 9 economic 
indicators. Note, that the preprocessing of the input data is basically done by 
calculating the scaled momentum of the time series (see [2]). 

In order to measure the performance of our ECNN, we compare its forecasts 
to those of two benchmarks: The first one refers to a naive strategy which as- 
sumes that an observed trend of the interest rate development will last for more 
than one time period. The second one is a 3-layer MLP with one input layer, a 
hidden layer consisting of 20 neurons with tanh(-) activation function and one 
output layer with a squared cost function simultaneously predicting the complete 
German yield curve for the 3 forecast horizons. The input signals, preprocessing, 
time scheme and data sets correspond to the dispositions described above. We 
trained the network until convergence with pattern-by-pattern learning using a 
small batch size of 1 pattern (see [2]). 

The performance of the models is measured by their realized potential, which 
is defined as the ratio of the accumulated model return to the maximum possible 
accumulated return. The accumulated return refers to a simple trading strategy 
using the particular yield curve forecasts, e. g. we sell bonds if we expect rising 
interest rates. Proceeding this way, we would expand our net capital balance in 
case of an higher interest rate level using the price shifts of the bonds. 

The empirical results are summarized in Fig. 4. It turns out, that the ECNN 
combined with variants-invariants separation is superior to both benchmarks. 
This is especially true for longer forecast horizons, e. g. 6 month ahead. Com- 
paring the two benchmarks, it becomes obvious that the naive strategy achieves 
better results forecasting short-term maturities (REX 1 to REX 5), while the 
3-layer MLP dominates in forecasting long-term maturities (REX 6 to REX 10) . 
The latter observation, which is true for every forecast horizon investigated, 
can be seen as the consequence of the short-term trend behavior of the interest 
rates. Interestingly the ECNN reaches a steady forecasting quality over the com- 
plete yield curve, i. e. there is no drawback in forecasting long-term instead of 
short-term maturities. The latter is due to the fact of the variants-invariants sep- 
aration. The ECNN realizes nearly 60% of the maximum reachable performance, 
considering monthly and quarterly yield curve forecasts. For the long-term fore- 
cast horizon, it turns out, that the ECNN is able to achieve up to 80% of the 
potential accumulated return. Note, that the other benchmarks are far behind 
these results. 
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Generalization Set, 1 Month Forecasts 




Generalization Set, 6 Months Forecasts 




Fig. 4. Realized potential of each trading strategy. 



4 Conclusion 

We introduced an ECNN architecture, which is able to handle external shocks 
without changing the identified dynamics of the underlying system. The ECNN 
can be extented by overshooting as well as a variants-inveriants separation, which 
allows the modeling of high dimensional, noisy dynamical systems. The empir- 
ical results indicate, that the performance of the ECNN is superior to more 
conventional forecasting techniques. 

The described algorithms are integrated in the Simulation Environment for 
Neural Networks, SENN, a product of Siemens AG. 
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Abstract. This paper deals with the application of saliency analysis to Support 
Vector Machines (SVMs) for feature selection. The importance of feature is 
ranked by evaluating the sensitivity of the network output to the feature input in 
terms of the partial derivative. A systematic approach to remove irrelevant 
features based on the sensitivity is developed. Five futures contracts are 
examined in the experiment. Based on the simulation results, it is shown that 
that saliency analysis is effective in SVMs for identifying important features. 



1 Introduction 

Over the recent past years, support vector machines (SVMs) have been receiving 
increasing attention in the regression estimation area due to their remarkable 
characteristics such as good generalization performance, the absence of local minima 
and sparse representation of the solution [1,2]. However, within the SVMs framework, 
there are very few established approaches for identifying important features. The 
issue of feature selection for SVMs is recently discussed in [3]. There it has been 
stated that feature selection is better to be performed in SVMs if many features exist, 
as this procedure can improve the network performance, speed up the training and 
reduce the complexity of the network. 

This paper proposes saliency analysis (SA) to SVMs for selecting important 
features. The SA measures the importance of features by evaluating the sensitivity of 
the network output with respect to the weights (weight-based SA) or the feature inputs 
(derivative-based SA). Based on the idea that important features usually have large 
absolute values of connected weights and unimportant features have small absolute 
values of connected weights, the weight-based saliency analysis is to detect irrelevant 
weights by evaluating the magnitude of weights, and then remove the features 
emanating these irrelevant weights [4]. This method is also extended into other types 
of weight-pruning by using a penalty term in the cost function to remove irrelevant 
features [5]. The derivative-based SA measures the importance of features by 
evaluating the sensitivity of the network output with respect to the feature inputs 
based on the partial derivative [6]. To irrelevant features which provide little 
information on the prediction, the output produces a small value of saliency metric 
which indicates that the network output is insensitive to those features. On the 
contrary, to significant features which contribute much to the prediction, the output 
will produce a large value of saliency metric. As the weights of SVMs lie in a high 
dimensional feature space, and the magnitude of weights is a measure of the 
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importance of the high dimensional feature inputs, rather than the original feature 
inputs, in this paper only the derivative-based SA is developed for SVMs. 

This paper is organized as follows. In section 2, we briefly introduce the theory of 
SVMs in the regression estimation. In section 3, the method of saliency analysis is 
described. Section 4 gives the experimental results, followed by the conclusions in the 
last section. 



2 Theory of SVMs for Regression Estimation 



Given a set of data points q = [(x, , d, )}f ( V, is the input vector, dj is the desired 
value, and N is the total number of data patterns), SVMs approximate the function 

y = f{X,) = Ym^i) + b 

where )}:'') are the high dimensional feature spaces which are nonlinearly mapped 
from the input space ■ The coefficients IT and b are estimated by minimizing 



0 otherwise 



( 2 ) 

(3) 



The first term 




L (d y ) empirical error, which are measured by using 



the ^’-insensitive loss function (3). The second term — |m'|^ is the regularization term. 

C and e are referred to as the regularized constant and tube size. 

To get the estimations of fV and b , equation (2) is transformed to the primal 

* 

function (4) by introducing the positive slack variables and 4", • 



Minimize: 



^Sl-Afs = +CX(^,+C') 

^ ;=1 



(4) 



Subjected to: - b,<e + ^i ^ and ^ <*> > o 

W(l>{X,) + b,-d,<£ + C, 

Finally, by introducing Lagrange multipliers and exploiting the optimality 
constraints, the decision function (1) has the following explicit form [7]: 

f(X, a, .«,■) = Z -o:)K{X,X,) + b 

In function (5), a, and a' are the so-called Lagrange multipliers, and K(X,,Xj) is 

defined as the kernel function. Any function that satisfies Mercer’s condition [7] can 
be used as the kernel function. 

Based on the Karush-Kuhn-Tucker (KKT) conditions of quadratic programming, 
only a number of coefficients ( a,— a; ) will assume nonzero, and the data points 
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associated with them have approximation errors equal to or larger than e , and are 
referred to as support vectors. According to (5), it is evident that support vectors are 
the only elements of the data points that are used in determining the decision function. 

3 Saliency Analysis of SVMs 



As illustrated in (5), the decision function in SVMs is expressed as: 

7=1 

where Ns is the number of support vectors. So the network output is dependent on the 
converged Lagrange multipliers ( a, -a') and the used kernel K (xj , x, ) • 

The sensitivity of the network output to the input is approximated by the 
derivative: 



(i) 



(ii) 



(iii) 



^1 
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The derivative of the output to the input can be calculated for any type of kernel 
function according to (6), and the value depends on the input feature , the support 



vectors Xj as well as the converged Lagrange multipliers ( - a* )• 

Then, the saliency metric of each feature is calculated as the absolute average of 
the derivative of the output to the input over the entire training data sample, which is: 






dx,, 



N 



Other calculations could be c, = 



( 7 ) 




: max{-^^} [8]. 

i=i, w dxii^ 



After calculating the saliency value, a criterion needs to be set up to determine 
how many features could be removed from the whole feature set. This paper uses a 
simple threshold method. The features with saliency value lower than the threshold 
are deleted while those with saliency value larger than the threshold are retained, as a 
small value of saliency in comparison with others means that the corresponding input 
does not significantly contribute to the network output and therefore could be 
disregarded from the overall feature set. A systematic procedure for eliminating 
insignificant features is outlined as follows. 

1 . Train SVMs using full feature set. 

2. Calculate 5, by (6) and (7) for each candidate feature. 

3. Rank s, in a descending order as >...>s^ , where =max{s,} , 



52 = max{5, }>•••> 

}»2 rux 

4. Choose a proper threshold S . 

5. If 5, > £ and s < £ , delete the features corresponding to the saliency value 

‘ ;+] 



'^,+1 ■ 



4 Experiment Results 

Five real futures contracts collated from the Chicago Mercantile Market are 
examined in the second series of experiments. They are the Standard&Poor 500 stock 
index futures (CME-SP), United Sates 30-year government bond (CBOT-US), Unite 
States 10-year government bond (CBOT-BO), German 10-year government bond 
(EUREX-BUND) and French government stock index futures (MATIF-CAC40). The 
daily closing prices are used as the data set. And the original closing price is 
transformed into a five-day relative difference in percentage of price (RDP). 

The input variables are constructed from 3 lagged transformed closing prices 
which is obtained by subtracting a 15-day exponential moving average from the 
closing price ( X, , X 2 , Xj ) and 14 lagged RDP values based on 5-day periods 
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(X 4 ,...,X, 7 ). These indicators are referenced from Thomason [9], but more numbers 

of indicators that are believed to involve redundant information are used here for the 
purpose of feature selection. The output variable RDP-^5 is obtained by first 
smoothing the closing price with a 3-day exponential moving average, as the 
application of a smoothing transform to the dependent variable generally enhances the 
prediction performance of neural network. Each data set is partitioned into three parts 
according to the time sequence. The first part is for training, the second part for 
validating which is used to select optimal kernel parameter for SVMs, and the last 
part for testing. There are a total of 907 data patterns in the training set, 200 data 
patterns in both the validation set and the test set in all the data sets. 

The Gaussian function is used as the kernel function of SVMs. The kernel 
parameter, c and € are selected based on the smallest normalized mean squared 
error (NMSE) on the validation set. The NMSE is calculated as 

( 8 ) 







id,-df 



where d denotes the mean of . The values of 



C and S slightly vary in 



futures due to the different market behaviors of the futures. The Sequential Minimal 
Optimization algorithm [10] is implemented and the program is developed using VC^ 
language. 

The selected features are reported in Table 1. As it is unknown whether the 
irrelevant features have been correctly deleted, the selected features are used as the 
inputs of SVMs to retrain the network. The NMSE of the test set for the full feature 
set and the selected feature set is given in Table 2. It is evident that there is smaller 
NMSE on the test set in the selected feature setthan that of using full feature set. The 
result is consistent in all of the five contracts. This indicates that saliency analysis is 
effective in selecting important features for SVMs. 



5 Conclusions 



The saliency analysis for ranking the importance of features has been developed 
for SVMs by evaluating the partial derivative of the output to the feature input over 
the entire training data samples. The threshold method is applied to delete irrelevant 
features from the whole feature set. According to the simulation results by using five 
real futures contracts, it can be concluded that saliency analysis is effective in SVMs 
for selecting important features. By deleting the irrelevant features, the generalization 
performance of SVMs is greatly enhanced. 

There are still some aspects that require further investigation. Although using a 
simple threshold method for determining how many unimportant features are deleted 
works well in this study, more formal methods need to be explored for complex 
problems. 
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Table 1. The selected features in the five futures contracts. 



Futures 


Selected features 


CME-SP 


X ,x , X 7 ,X ,2 


CBOT-US 


X, , X 2 , X , Xq , X 12 Xj 


CBOT-BO 


, X 2 ; X ; Xg Xg , Xq , ^12 


. EUREX-BUND 


X, , X 2 , X , Xj , X 12 Xj 


MATIF-CAC40 


X] , X 2 , X , X, , Xj 2 X 7 



Table 2. The NMSE on the test set of the selected features and full futures. 



Features 


Full features 


Selected features 


CME-SP 


0.9629 


0.8442 


CBOT-US 


1.1643 


1.0501 


CBOT-BO 


1.1853 


0.9936 


EUREX-BUND 


1.4762 


1.0792 


MATIF-CAC40 


1.1582 


0.9584 
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Abstract. This paper proposes s -descending support vector machines ( e - 
DSVMs) to model non-stationary financial time series. The s -DSVMs are 
obtained by taking into account the problem domain knowledge of non- 
stationarity in the financial time series. Unlike the original SVMs which use the 
same tube size in all the training data points, the s -DSVMs use the tube whose 
value decrease from the distant training data points to the recent training data 
points. Three real futures which are collected from the Chicago Mercantile 
Market are examined in the experiment, and it is shown that the £'-DSVMs 
consistently forecast better than the original SVMs. 



1 Introduction 

Financial time series are inherently noisy and non-stationary [1, 2]. The non- 
stationary characteristic means that the distribution of financial time series changs 
over time. This will lead to gradual changes in the dependency between input and 
output variables. In the modeling of financial time series, the learning algorithm used 
should take into account this characteristic. Usually, the information provided by the 
recent data is weighted more heavily than that of the distant data [3]. 

Recently, support vector machine (SVM) developed by Vapnik and his co- 
workers in 1995 [4] as a novel neural network technique has received increasing 
attention in the area of regression estimation [5,6] due to its remarkable generalization 
performance. SVMs implement the Structural Risk Minimization principle which 
seeks to minimize an upper bound of the generalization error rather than minimize the 
empirical error as commonly implemented in other neural networks. Another key 
property of SVMs is that training SVMs is equivalent to solving a linearly constrained 
quadratic programming. Consequently, the solution to the problem is only dependent 
on a small subset of training data points called support vectors. Using only support 
vectors, the same solution of the decision function can be obtained as using all the 
training data points. 

What are support vectors? In regression estimation, they are the training data 
points which associated approximation errors are equal to or larger than b, the so- 
called tube size. That is, they are the data points lying on or outside the e -bound of 
the decision function. In usual case, the number of support vectors decreases as s 
increases. In the case, of a wide tube where there are few support vectors, the decision 
function can be represented very sparsely. However, too wide a tube will also 
depreciate the estimation accuracy as e is equivalent to the approximation accuracy 
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placed on the training data points. In the standard SVMs, s is used as a constant value 
and selected empirically. 

In this paper, the authors propose e -descending SVMs ( e -DSVMs) to model 
financial time series by associating the relationship between support vectors and s 
with the non-stationary characteristic of the financial time series. The f -DSVMs use 
the tube whose value will decrease from distant training data to recent training data to 
deal with the structural changes of the financial time series. There are two reasons for 
this modification. Firstly, since support vectors are a decreasing function of £■ , by 
using a smaller e the recent data will have greater probability of converging to the 
determinant support vectors and thus be obtained more attention in the representation 
of the decision function than the distant data. Secondly, from the approximation 
accuracy point of view, the recent data will be approximated more accurately than the 
distant data. This is desirable according to the non-stationarity of the financial time 
series. The proposed method is illustrated experimentally using three real futures 
contracts. The experiment shows great improvement by the use of e -DSVMs. 

This paper is organized as follows. Section 2 gives a brief introduction of SVMs 
in the regression estimation. Section 3 presents the s -DSVMs. Section 4 gives the 
experimental results together with the data preprocessing technique. Section 5 
concludes the work done. 

2 Theory of SVMs for Regression Estimation 



Given a set of data points <7 = (jC; is the input vector, rf; is the desired 

value, and n is the number of training data points), SVMs approximate the function 
using the following form; 

i=\ 



where are the high dimensional feature spaces which are nonlinearly 

mapped from the input space x . The coefficients kl." and b are estimated by 
minimizing the regularized risk function (2). 



^SVMs 



(C) = C-y4(rf;,y;) + ^| 

2 ' 







\d-y \ ^ s 

otherwise 



(2) 

( 3 ) 



1 V , 

The first term C—^^LAdi,yi) is the empirical error, which are measured by using 



the £■ -insensitive loss function (3). The second term — ||h’|| is the regularization term. 
C and E are referred to as the regularized constant and tube size. 
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To get the estimations of {w,. and b , equation (2) is transformed to the primal 

* 

function (4) by introducing the positive slack variables and . 

Minimize: (H>, = -|wf +C^ (4"/ +^,‘) (4) 

/=1 

Subjected to: ~ ^ ^ , and > o 

) + bi -df <e + 

Finally, by introducing Lagrange multipliers and exploiting the optimality 
constraints, the decision function (1) has the following explicit form [4]: 

n 

f{x,ai,a]) = '^(ai-at)K{x,Xt) + b (5) 

/=! 

In function (5), a, and a’ are the so-called Lagrange multipliers. They satisfy the 
equalities Oi*a] = 0, a,- > 0 and a* > 0 where i = and they are obtained by 

maximizing the dual function of (4), which has the following form: 

R(ai , fl,* ) = ^ di (tti - a*) - (a,- + a • ) - - ^ (fl; - a* ){aj -a))K(Xi,Xj) (6) 

/=i <=i ^ i=\ ;=i 

n ^ 

with the following constraints: ^ (<j. - a*) = 0 > 0 ^ < C , and 0 < a,- < C,l = 1,2.. .« 

i-l 

K{xi,Xj)is defined as the kernel function. Any function that satisfies Mercer’s 
condition [4] can be used as the kernel function. Based on the Karush-Kuhn-Tucker 
(KKT) conditions of quadratic programming, only a number of coefficients ( a,— a* ) 
will assume nonzero, and the data points associated with them have approximation 
errors equal to or larger than e , and are referred to as support vectors. According to 
(5), it is evident that support vectors are the only elements of the data points that are 
used in determining the decision function. 

3 E -Descending Support Vector Machines ( s -DSVMs) 

In £■ -DSVMs, instead of a constant value, the tube size s adopts the following 
exponential function. 

1 + exp(a - 2fl * i7 u) <i\ 

s,- So 2 

Where n is the total number of training data patterns, with i = n being the most recent 
observation and i = 1 being the earliest observation, a is the parameter to control the 
descending rate. The ideas of ^-DSVMs are, firstly, to give the recent data higher 
chance of converging to support vectors and secondly, to place higher approximation 
accuracy on the recent data than the distant data, therefore paying more attention to 
the recent information than the distant information since the recent information is 
more important than that of the distant data in the non-stationary financial time series. 
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The behaviors of the weight function can be summarized as follows ( some examples 
are illustrated in Fig. 1 ): 

(i) When a ^ 0 , then = ■?o • In Ih's case, the weights in all the training 

data points are equal to 1 .0. 

(ii) When fl ^ CO , then oo /<.j . In this ease, the weights for the first 

i>l 

l2 2 

half of the training data points are increased to an infinite value while the 
weights for the second half of the training data points are equal to 0.5. 

(iii) When a s [0,oo]and a increases, the weights for the first half of the training 
data points will become larger while the weights for the second half of the 
training data points will become smaller. 




Fig. 1. Weights function of e -DSVMs. 

In £ -DSVMs, the regularized risk function has the original form but the 
constraints are changed according to (8) whereby every training data point 
corresponds to different tube size . 

+ 4 -; ) 

(=1 

Subj ected to : ~ )-b,<s,+C (g) 

w^(Xi) + b, -d, <e, + f * 

Thus, the dual function becomes as (9) with the original constraints. 

n n ^ n n 

R(aj,a;) = -aJXfly -ay)A'(x;,Xy) (9) 

i=l /=1 i=l J=l 

n 

Constraints: ^(fl, -flj) = 0 , 0 < a, < C , and 0 < a* < C,/ = 
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4 Experiment Results 

Three real future contracts collected from the Chicago Mercantile Market are 
examined in the experiment. They are the Standard&Poor 500 stock index futures 
(CME-SP), United Sates 30-year government bond (CBOT-US) and German 10-year 
government bond (EUREX-BUTND). The daily closing prices are used as the data set. 
The original closing price is transformed into a five-day relative difference in 
percentage of price (RDP). The input variables are constructed from four lagged RDP 
values based on 5 -day periods and one transformed closing price which is obtained by 
subtracting a 15-day exponential moving average from the closing price. The output 
variable RDP+5 is obtained by first smoothening the closing price with a 3-day 
exponential moving average. Then, all the data points are scaled into the range of [- 
0.9, 0.9] as the data points include both positive values and negative values. There are 
a total of 907 data patterns in the training set and 200 data patterns in both the 
validation set and the test set in all the three data sets. 

The Gaussian function is used as the kernel function of the SVMs. Both and 
C are chosen as 10 as these values produced the smallest normalized mean squared 
error (NMSE) on the validation set. The NMSE is calculated as 

= ( 10 ) 

s^ = -^tid,-dy 

where d denotes the mean of . With respect to , a range of reasonable values 

from 0.001 to 0.1 are studied. The Sequential Minimal Optimization algorithm [7,8] is 
implemented and the program is developed using VC^ language. 

Table 1 gives the results of s - DSVMs and the original SVMs for the three 
futures. It can been seen that there are smaller converged NMSE in the e -DSVMs 
than those of the original SVMs in all the investigated sq . The result is also 
irrespective of the futures. Fig. 2 gives the predicted and actual values of RDP-i-5 on 
the test data points. It is obvious that e -DSVMs forecast more closely to the actual 
values and capture the turning points better than the original SVMs. 

5 Conclusions 

This paper proposes a modified version of SVMs to model financial time series by 
incorporating the non-stationary characteristic of financial time series into SVMs. 
These modified SVMs use the non-constant tube which sizes decrease from the 
distant training sample data to the recent training sample data in an exponential form. 
The simulation results demonstrated that the e -DSVMs is more effective in modeling 
financial time series by considering the structural changes of the financial time series. 

Future work will involve a theoretic analysis of the e -DSVMs. More 
sophisticated weights function which can closely follow the dynamics of financial 
time series will be explored for further improving the performance of support vector 
machines in financial time series forecasting. 
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Table 1. The NMSE of e - DSVMs and the original SVMs. 



^0 


CME-SP 




CBOT-US 




EUREX-BUND 




£ -DSVMs 


SVMs 


£ -DSVMs 


SVMs 


£ -DSVMs 


SVMs 


0.001 


0.8910 


0.9512 


1.0902 


1.1745 


1.1794 


1.2603 


0.005 


0.8841 


0.9545 


1.0907 


1.1750 


1.1976 


1.2697 


0.01 


0.9023 


0.9447 


1.0697 


1.1784 


1.1870 


1.2599 


0.05 


0.8961 


0.9508 


1.0992 


1.1658 


1.2052 


1.2522 


0.1 


0.9172 


0.9550 


1.0911 


1.1767 


1.2058 


1.2534 




(ii) (b) (c) 

Fig. 2. I’rcdiclcd and actual values of RDP+S. (a) CME-SP. (b) CBOT-US. (c) EUREX-BUND. 
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Abstract. In this paper, a new indicator - WARS (Weighted Accumulated 
Reconstruction Series) at classifying the state of financial market, either 
trending state or mean-reverting state, was presented. Originated from the 
computation of Entropy, this new indicator was found to be able to reflect the 
market behavior accurately and easily. The algorithm of generating WARS and 
its meaning related to Entropy were introduced and some comparison results 
between WARS and the Daily Profit Curve were listed. As a new indicator, 
WARS also can be used to build a trading system - to provide buy, sell and hold 
signals. Through the application on S&P 500 index, it was verified to be 
effective and was a promising indicator. 



1 Introduction 

One of the basic tenets put forth by Charles Dow in the Dow Theory [1] is that 
security prices do trend. Trends are often measured and identified by "trendlines" and 
they represent the consistent change in prices (i.e., a change in investor expectations). 
In the Fig. 1 and Fig. 2, rising trend and falling trend were illustrated. 




A principle of technical analysis is that once a trend has been formed, it will 
remain intact until broken [2]. The goal of technical analysis is to analyze the current 
trend using trendlines and then either invest with the current trend until the trendline 
is broken, or wait for the trendline to be broken and then invest with the new 
(opposite) trend. For trading, it is very important to know the current market state - 
either in the rising trend or in the falling trend. So our work was focussed on 
searching for indicators that can reflect the fluctuation of price or index in the 
financial markets. An indicator, called Weighted Accumulated Reconstruction Series 
(WARS), has been constructed and found to have interesting characteristics. It can 
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reflect the trend of the changes in price. The indicator was able to make use of more 
information contained in the data than moving average and therefore may be able to 
better reflect the state of the price or index. 



2 Weighted Accumulated Reconstruction Series (WARS) 



The idea of generating Weighted Accumulated Reconstruction Series (WARS) came 
from the computation of Entropy. The concept of Entropy was first proposed by 
Shannon [3] in the Information Theory as a measure of the complexity of a system. 
Up to now, this concept has been applied in the economic domain to measure the 
production flexibility [4], customer requirements [5], and processing cost of 
administrating the production facility [6]. In the capital market, a derivative of 
information entropy - Kolmogorov Entropy, was applied to measure how chaotic a 
system is based on the analysis of real-time price or index [7, 8, 9, 10, 11, 12]. By 
calculation of Kolmogorov Entropy, the predictability of the price changes or returns 
is studied. Kapur and Kesavan [13] even used the Kullback's Minimum Cross-Entropy 
Principle to minimise the risk in portfolio analysis. 

The Shannon Entropy, represented by Ent(S), of a system is defined as; 

k 

Ent(S) = -^P{Ci,S)\og(P{Ci,S)) . (1) 

(=1 

where C, presents the ith event in system S, i = 1, 2, . . . , k; 

P(C„S) is the a priori probability of event C,'s occurrence in S. 

Prom the above definition in Eq. 1, the distribution of the system must be known 
before the System Entropy is calculated. But in practice, usually the distribution of the 
system may not be known in advance. The easiest way to solve this problem is to 
accumulate this series and it will follow the exponential function for a positive series. 
Following this idea, the algorithm to construct the new indicator is formulated in. 

Step 1: Normalize every value of this series between -1 to 1 (to remove amplitude 
effect off the series) 

a:,- = A:,/max(|A:;|) ; (i = 1, 2, . . ., Win_length) (2) 

Step 2: Subtract the first value of a series (to keep all the intervals at the same 
beginning, the origin of coordinates). 

Xj = xi-xi- (i = 1 , 2, . . . , Win_length) (3) 

Step 3: Subtract the mean value from the whole series. 

xi = xi - Mean _x \ (i = 1 , 2, . . . , Winjength ) (4) 



where: Mean_x 




Step 4: Reconstruct a new series (WARS) by means of weighted accumulating the 
original one. 



Weight j = 



1 2 -I 1 j 

l-t-2-i twin _length 



yi 



1 



1-I-2H tn 



A ; 



(j = 1, 2, ..., Win_length) 



(5) 
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>’n ^ 



l + 2 + --- + n 



XI + 



1 + 2 



1 + 2 h m 



+2 + ••■ + 



1+ 2 -i hrt 

; 

1 + 2 h bn 



In this process, the more recent points have more contribution to WARS. 
Step 5: Calculate the area of this interval and get its absolute value. 

Area = |yi + y2 + -" + y„|- 



(6) 

(7) 



The trending and mean-reverting states were distinguished according to the area 
value. If the area value is greater than 0, then market is in a trending state, else the 
market is in a mean-reverting state. 

In Figure 3, three curves representing up-trending, down-trending and mean- 
reverting series were drawn. After weighted accumulated reconstruction, the 
corresponding three curves were drawn in Figure 4. 



Simulated Scries 




Fig. 3. The illustration of Accumulated 
Reconstruction Series 



Acrumulaled ReconstnictioD Scries(ARS) 







Fig. 4. The illustration of Accumulated 
Reconstruction Series 



During the process of generating WARS, the original series was rolled and the area 
of every interval was calculated iteratively. For instance, if there are 10 points in a 
series; and Win_length is chosen as 4. From the 1st point to the 4th point the first area 
value is calculated. From the 2nd point to the 5th point the second area value is 
obtained. This process is repeated and eventually six points are obtained to construct 
WARS. The length of WARS equals to the length of original series less Win_length. 



3 Comparison of WARS and Daily Profit Curves 

For a large company, usually a certain strategy will be adopted to direct its operation 
in the financial market. Further, the equity curve of a period will be used to evaluate 
the pros and cons of this strategy [14]. If the equity curve goes up, the company is 
making a profit and vice versa. The generation of the equity curve will not be 
introduced in this paper. This was taken to be a given information. For the testing of 
the new indicator, 15 historical futures data supplied by Man-Drapeau Research Pte 
Ltd (Singapore) were selected. The WARS was generated from the daily close price. 
For the convenience of comparison of the above two curves, the equity curve was first 
changed to daily profit curve by using the following method; 
yi = Xi-Xi_i. (i= 1,2, ...,n) 



(8) 
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Then the moving average curve [15] was calculated for both WARS and Daily 
Profit Curve. The correlation coefficient of WARS and the Daily Profit Curve was 
calculated to evaluate their similarity. 



Moving Average of WARS and Daily Profit Curve Moving Average of WARS and DaQy Profit Curve 





Fig. 5. The comparison of WARS and Daily 
Profit C.urvp for .SP Futures 



Fig. 6. The comparison of WARS and Daily 
Profit Curvp for TA Futures 



In Figures 5-6, the two curves were rescaled between -1 and 1 so that they can 
be compared directly. The solid line curve in Figures 5-6 represented WARS while 
the other curve represented the Daily Profit Curve. In these Figures, the two curves, 
WARS and Daily Profit Curve were found to have similar shape. It was clear that 
WARS reflected the fluctuation of Daily Profit Curve. From the correlation 
coefficient, the value was always larger than 0.6 (sometimes almost equal to 0.95). 
Thus WARS reflected the changing of Daily Profit Curve quite well. 

From the generation of WARS, its meaning in the financial market can be 
interpreted as follows; WARS was generated by using the closing price in a period. If 
within this period, the price changes in a trending way, either up-trending or down- 
trending, WARS will maintain its large value or it may go up. When the price 
fluctuates in a mean-reverting way, WARS will go down or remain as a small value. 
Indirectly, it continuously reflects the changing of price. From the view of entropy, it 
can be simply interpreted as follows; The market states can be represented as 1 - up- 
trending, 0 -mean-reverting and -1 - down-trending. When the market falls in the 
trending state (either 1 or -1), the entropy of market equals to 0, corresponding to 
large value of WARS, close to 1. The mean-reverting state (0) is composed of up- 
trending and down-trending states, when the entropy of market equals to log 2 



( £«r = -(-^log^+-^log^) ), corresponding to a small value of WARS, close to 0. It is 



easy to understand the above results. When the market moves in a trending way, the 
system is more certain than in a mean-reverting state, in which the direction of price 
cannot be determined, i.e. more uncertainty contained in this system. 



4 Using WARS to Generate Trading System 

Based on the previous analysis, in this section, WARS was used to generate a trading 
system. The trading system [16] was built in the following steps; 

• Calculate WARS using historical data. 
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• Determine the threshold value for buying and selling action according to the value 
of WARS calculated using training data set. 

• Generate the trading signals as follows: 

If value of WARS > threshold of buying, then buy at the next day's opening price 
If value of < threshold of selling, then sell at the next day's opening price 

If value of WARS is in between the thresholds, then no action is taken - hold. 

Figure 7 depicts such a trading system using S&P500 index. The time interval of 
data was daily. The was calculated based on 5-day-Win_length. 



CME-SP 500 




Fig. 7. Trading system generated using WARS on S&P 500 index 

From the Figure 7, it can be seen that gave correct trading signals at suitable 

time when the market changed gradually, from Jan. 1993 to Oct. 1997. But when the 
market changed dramatically, from Nov. 1997 to Apr. 1999, some signals given were 
lagging behind the change of price. This indicated that WARS is a reactive indictor. 
The trading performance was illustrated in Table 1. 



Table 1. Trading system performance based on the indicator WARS on S&P 500 



Training Data Set 


Testing Data Set 


Training period: 


Testing period: 


01/04/1988 - 12/31/1992 


01/01/1993 - 08/12/1999 


ma.x_WARS_area = 0.022152 


Net_profit = 873.649963 


min_W/U?5'_area = -0.034170 


max_win = 325.150024 


threshold_buy = 0.003378 


max_loss = -37.000000 


threshold_sell = -0.015396 


Trading_number =14 


Mean_WAJ?S_area = 0.000446 


Winning_Trade = 9 


Std_lTA/?S_area = 0.005593 


Sharpe_ratio = 0.52227 1 



There were altogether 14 trades in this index, among which 9 of them were 
profitable. From these results, it can be seen that this new indicator is effective in 
differentiating the market states and so it can be used to trace the changing market and 
provide the trading signals. 
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5 Concluding Remarks 

A new indicator - Weighted Accumulated Reconstruction Series (WARS) is presented. 
In comparison with the Daily Profit Curve, WARS can indicate the Daily Profit Curve 
accurately and easily. In addition, WARS can be used to build a trading system. 
Through the application on S&P 500 index, it can be seen that this indicator is 
effective and promising. Further, WARS can be used to reflect the uncertainty of the 
market. When the magnitude of WARS approaches 1, the market is a strong trending 
state and therefore more investment can be done. 

Although WARS is very similar in behaviour to the Daily Profit Curve, there are 
still many factors affecting the final results, such as the parameters: Win_length and 
Moving Average Interval. These issues will be further studied in the future 
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Abstract. We propose a neural network based “left shoulder” detector. The 
auto-associative neural network was trained with the “left shoulder” patterns 
obtained from the Korea Composite Stock Price Index, and then tested out-of- 
saniplc with a reasonably good result. A hypothetieal investment strategy based 
on the detector achieved a return of 124% in comparison with 39% return from 
a buy and hold strategy. 



1. Introduction 

Technical analysts use certain stock chart patterns and shapes as signals for profitable 
trading opportunities [6], Many professional traders claim that they consistently make 
trading profits by following those signals. Recently there have been efforts to identify 
“change point” with data mining technique [3]. 




Figure 1. Head and shoulder formation 

One of the best known chart patterns is “Head and Shoulder Formation” (HSF) [7] 
(see Figure 1). The HSF is believed to be one of the most reliable trend reversal pat- 
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terns. Especially “Left Shoulder”(LS) of HSF is a starting event of HSF. In particular, 
if we can detect an LS of HSF, it can be used to make profitable trading like buying 
stocks in individual stock market. With a detection of “Head” of HSF, a short sale 
makes a profitable trading. 

Generally technical analysts try to detect HSF manually after the fact. The process is 
obviously subjective and the prediction is often incorrect. However, it is possible to 
accurately detect HSF from historical data. 

Thus, a pattern classification method such as a neural network is an ideal candidate. 
The detection problem can now be formulated as a 2-class problem. A neural network 
is trained with LS patterns and non-LS patterns. Then, given a new input data, or a 
current situation, the network tries to classify it as an Ls of HSF or a non-LS of HSF. 
A problem with this approach is the inability to collect a sufficient number of “non- 
LS” patterns. This is a well known problem of “partially-exposed environment” in 
pattern classification where training data from one class arc very few or non-existent. 
Related problems include counterfeit bank note detection and typing pattern identity 
verification [1]. 

Recently, Auto-Associative Neural Network (AANN) has been proposed to be quite 
effective in partially-exposed environment [1]. AANN is basically a neural network 
whose input and target vectors are the same. In session 2 the details of AANN are 
reviewed and the reason of LS detection is provided. 

The proposed detection process is as follow. First, the LS patterns are identified in 
historical database. Second, they are used to train AANN. Third, the trained AANN is 
used as an LS detector. An input pattern is compared with the output. If they are simi- 
lar enough, the input pattern is classified as LS. Otherwise it is classified as non-LS. 
The LS signal could result in “buy” recommendation while non-LS signal results in 
“sell” or other action recommendation (see Figure 2). 



[33-f 
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Figure 2. System framework 



2. Auto-associative neural network as an LS detector 

AANN should reproduce an input vector at the output with a least error[2]. Let F de- 
note an auto-associative mapping function, Xj an input vector and an output vec- 
tor. Then network F is usually trained to minimize the mean square error given by the 
equation: 
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Mapping function Fean be separated into Fj and F 2 , that is F(.) = F 2 (F,(.)) 
where F| is a dimension reduction process and F 2 a dimension expansion process. 
Dimension reduction is achieved by projecting the vectors in the input space onto a 
subspace captured by the set of weights in the network part for Fy . Dimension expan- 
sion is achieved by mapping the lower dimensional vectors onto a hypersurface cap- 
tured by the set of weight in the network part for Fj . Generally subspace and hyper- 
surface are nonlinear because of the nonlinearity in the transfer function. 

Historical financial data have particular trends and characteristics. They tend to repeat 
themselves. The financial situations that correspond to LS of HSF are assumed to have 
unique characteristics. If the core information can be incorporated into the network 
input variables, the unique characteristics can be captured by the subspace of AANN 
embodied by the transformation at the hidden layers. Once AANN is trained with LS 
data sets, any LS data that shares common characteristic will result in a small error at 
the output layer while non-LS data will result in a large error at the output layer. With 
an appropriate threshold, the AANN can be used to detect the occurrence of the LS. 



3. Data collection and neural network training 



We used Korea Composite Stock Price Index (KOSPI) data from April 1,1977 to July 
24, 2000 for experiment. The KOSPI is a kind of a market-value weighted index, 
similar to S&P 500 and TOPIX [8]. The base date is January 4, 1980 with the base 
index of 100. For each trading day / , let opening index denote 0(/) , high //(/) , 
low L(/) , closing C(/), net change A(/) , volume F(i') and turnover M(/) . Various 
moving averages of 20 days and 5 days were calculated as follows: 






ic(j),cUi)-\ i- 

20 J=i-i9 5 j=i-4 






20 y=/-i9 



20 y=(-i9 20 j=i-\9 

We used a total of 10 input variables [0{i)-C]^{i)\ 
eZim [C(/)-Ci^(0] , , [M{i)-MZii)\ , 

[C(/)-C^(/)] , C(/) and log(C(/)/C(/-l)) . Combining daily data with moving 
averages can reduce the number of input variables effectively, while maintaining his- 
torical information. Reducing the number of input variables helps to prevent overfit- 
ting [4]. 
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Figure 3. KOSPI data set from 1997 to 2000 

Figure 3 displays KOSPI data used in this experiment. Trading days before March 
1999 were used for training while those days after were used for testing. Since only 
those days constituting LS were used for actual training and validation, however, only 
86 days and 27 days were actually used to train and validate AANN, respectively. 

LS days were manually selected by one of the authors basically. The criteria used are 
as following: 

-Choose selling climax that is defined as a local maximum over 30 day periods (see 
Figure 1). 

-Draw neckline with head and shoulder. 

-Collect data corresponding to LS (For more details of term, refer to [5], [6]). 

The AANN used has a 10L-12N-7N-12N-10L structure where L denotes a linear 
transfer function while N denotes a nonlinear transfer function (tangent sigmoid used). 
A 5 layer network with nonlinear transfer functions can perform better dimension 
reduction than a 3 layer network[2]. A gradient descent was employed to minimize the 
error function with an early stopping method to prevent overfitting. The experiment 
was performed on MATLAB 5.3 



4. Results 



Figure 4 shows the KOSPI during the test period as well as the network’s prediction of 
LS indicated by thick bars. A threshold of 0.3 was empirically determined based on 
the performance with the training set. 

We employed two classification measures of False Rejection Rate (FRR) and False 
Acceptance Rate (FAR) and a financial measure of return rates. Let us de- 
fine D(i) , L{i) as following: 



£)(/■) = 



if classified as non - LS at ith day 
if classified as LS at ith day 




if non - LS at ith day 
if LS at ith day 
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Figure 4. The results of test set 

The classification measures and FRR and FAR are defined as 

FRR = (S L(i) - Z D(0) / i L(i) and FAR = Z D{i) ! {n~t L{i)) 

(=1 / 6 {/|/.(/)= l ) / ;=1 te{'|i(/)=0} / (=1 

where n is the total number of test set. 

The proposed approach is evaluated based on a financial measure of return rate. Let us 
envision a hypothetical investment strategy based on LS signal, i.e. D{i) . The so 
called LS strategy dictates “buy” when D(i) changes from 0 to 1 and “sell” when 
D(i') changes from 1 toO. For the comparison buy and hold strategy was also evalu- 
ated; 



Return of LS strategy = Z {0{i + 2) - 0{i + 1)) 0{\) 

fe{/|D(/)=l) / 

Return of Buy and Hold = (C(n) -0(1))/0(1) . 

We assumed that one buys or sells at the next day’s opening price and that the market 
is perfectly liquid with no transaction cost. 

The Performance of AANN in test set (March 2, 1999-July 24, 2000) is given in Table 
1. There is a trade off between FRR and FAR. If the threshold of event score in- 
creases, FAR becomes smaller and FRR becomes lager. The Return of LS strategy is 
124%, three times as much as the return of buy and hold strategy. 



Table 1. Performance of AANN in test set 



Measurement 


Value (total 348 days) 


False Rejection Rate 
False Acceptance Rate 


22.2 % (20 days / 90 days) 
9.6 % (25 days / 258 days) 


Return of LS strategy 
Return of Buy and Hold 


124 % (648.2 points Z524.9 points) 
39 % (203.9 points / 524.9 points) 
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5. Conclusions 

In this paper, we proposed a neural network based detector of “left shoulder” in “head 
and shoulder formations”. The auto-associative neural network was trained with the 
“left shoulder” patterns obtained from the Korea Composite Stock Price Index for 23 
months (April 1997-February 1999). And then tested on out-of-sample period of 
March 1999-July 2000. The preliminary result was surprisingly good given the fact 
that the training period coincided with the worst financial crisis of the nation’s history. 
A hypothetical investment strategy based on the detector achieved a return of 124% in 
comparison with 39% return from a buy and hold strategy. 

There are several limitations in this work. First, the performance criteria used have a 
lot to be desired. False Acceptance Rate and False Reject Rate are problematic since 
they simply count the number of days, thus unfairly give more weight to detection of 
slowly arising left shoulder. Second, the left shoulder detection leads to market entry 
signal. Even more important is to find a way to give market exit signal. Detection of 
head or right shoulder may help. Third, KOSPI itself is not tradable. But KOSPI 200, 
a subset of KOSPI, is. Futures and options use it as an underlying asset. 

It will be worthwhile to investigate whether a network trained with KOSPI data can 
detect LS in other data sets such as KOSPI 200 or other individual equity stocks. And 
it will be also useful to detect such widely used chart patterns as symmetrical triangles, 
descending triangles, ascending triangles, double bottoms, double top and rising 
wedges [7]. 



Acknowledgements 

This research was supported by Brain Science and Engineering Research Program 
sponsored by Korean Ministry of Science and Technology and by the Brain Korea 21 
Project to the first author. 



References 

1. S. Cho, C. Han, D. Han, & H. Kim.(2000). Web based Keystroke Dynamics Identity Verifi- 
cation using Neural Network. Journal of Organizational Computing and Electronic Com- 
merce. In print 

2. C. Bishop. (1995). Neural networks for pattern recognition. Oxford: clarendon press. 

3. V. Guralnik, J. Srivastava.(1999). Event Detection from time Series Data. KDD-99 Proceed- 
ing of (he fifth ACM SIGKDD International Conference on Knowledge Discovery and Data 
Mining, pp 33-42. 

4. G. Deboeck.(1994). Trading on The Edge. John Wiley & Sons, Inc. 

5. W.Eng.(1988). The Technical Analysis of Stocks, Options & Futures. McGraw-Hill. 

6. TradeTalk company. (2000). hltp://www.tradertalk.com/tutorial/h&s.html. 

7. Borsanaliz.com company. (2000). “Tools for technical analysis stock exchange” , htlp://www. 
geocities.eom/wallstreet/floor/1035/formations.htm 

8. Korea Stock Exchange. (2000). “KOSPI & KOSPI 200”, http://www.kse.or.kr. 




A Computational Framework 
for Convergent Agents 



Wei Li 

Department of Computer Science 
Beijing University of Aeronautics and Astronautics 
Beijing 100083 China 



Abstract 

As a computational approach, a framework is proposed for computing 
the limits of formal theory sequences. It defines a class of agents, called 
convergent agents. The approach provides a method to generate a new 
theory by the limit of some sequence of theories, and also has potential ap- 
plications to many scientific and engineering problems. As applications of 
the framework, some convergent agents are discussed briefly, e.g., GUINA, 
which can learn new versions from the current versions of a theory and 
some external samples, and the learned versions converge to the truth one 
wants to know. 

Keywords: Formal theory sequence. Limit, Convergent agent. Induc- 
tive inference. Algebraically closed fields. 



1 Introduction 

There is a class of agents which have the following computational characters: 
each agent would constantly access some countably inhnite external data Si, S 2 , 

• • • , S'fe , • • • as its inputs; and for the computation round it generates U as 
its output. The outputs form an inhnite sequences: ri,r 2 ,---,Lfe,--- and the 
sequence is convergent to a certain limit. The following example demonstrates 
the feature: 

Given a data set stored on a network it can be expressed as a language 
E = {wi,W 2 , ■ ■ •} and givn its non-terminals V (in fact this restriction can 
be removed) and terminals T the question is: Hov to construct a grammar 
G = {V,T, P, S) such that E = L{G)1 The problem may be solved in an 
evolutionary way as follows: 

1. At the beginning W choose an initial grammar Go = {V, T, Pq, S) where 
Pq is a set of production rules and can be viewed as our hrst guess of the 
rules. 

2. We then check if wi G L{Go) if ys then check if W 2 G L{Go) 
otherwise we have wi ^ L{Go). Let a be the sentential form obtained by 
(partial) from-bottom-up parsing analysis of wi in Go then there are tvo 
possibilities: 

K.S. Leung, L.-W. Chan, and H. Meng (Eds.): IDEAL 2000, LNCS 1983, pp. 295-300, 2000. 
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(a) If L((V, T, Pq U {S — )■ q}, S)) 0 then letPi — Pq U {P — )■ a, P — )■ 

till). Thus w get a new grammar Gi — (V, T, Pi, S) by adding new 
productions. 

(b) If L{{V,T, Pq U {P — )■ q}, S)) — 0 then w; have to constract Pq by 
deleting some rules from Pq whih contradicts S ^ a and get Pq. 

3. Generally forG*,-i w; check if Wk G L(Gk-i) if ys then check if Wk + i G 
L(Gk-i) ■■■ otherwise either add new prodnctions i.e.Pk := Pk-i U 
{P — Q, P — Wk] to get a new grammar Gk or revise it as in the snb-case 
2. a. Here a is the sentential form obtained by (partial) from-bottom-np 
parsing analysis of Wk in Gk-i- 

The above process can be easily specified by an agent which we call Analysis 
[Go,E). Let Ek — {wi, ■ ■ ■ ,Wk] ■ It will generate a sequence of grammars: 
Go, Gi — Analysis{Go , Pi), ■ ■ ■ ,Gk — Analysis(Gk-i , Ek), ■ ■ ■ whereUj“ iPi — 
E. If E is finite then Analysis will stop in finite steps. If E is infinite then 
the Analysis may execnte forever bnt in maiy cases the sequence {G„| will 
have a ’’limit” G. The compntational properties of Analysis are listed as follows 
informally: 

(1) Analysis is a Tnring machine. It takes the cnrrent grammar Gk and 
some parts of the data P*,_|_i to be analyzed as its inpnts and outputs a new 
grammar Gk+i- (2) The inpnt E to Analysis is conntably infinite. (3) At the 
beginning a gramma Go can be taken as an initial grammar to feed Analysis. 
(4) In each compntational ronnd Analysis takes the cnrrent grammar Gk-i and 
external data Ek as its inpnts and generates a new grammarG*,. In other words 
external information E is involved in the compntation process. (5) The ontpnts 
of every ronnd of Analysis form a seqnence \Gk] ■ (6) The compntational process 
is rational if the grammar seqnence {G*,} is convergent to some limit G and 
P(G)-P. 

From now on we call the agents with the compntational characters described 
above convergent agents. The pnrpose of the paper is to formalize these charac- 
ters and to demonstrate them by examples. 

2 Limits of Theory Sequences and Convergent 
Agents 

In this section w; will define convergent agents based on the classical compnta- 
tion model-Tnring machine and the notion of the limit given in (Li 1992). 

In the rest of the paper w nse the following standard notations: P is a first 
order language L is a finite theory L ={Ai, ■ ■ ■ , Am] whereAj’s are formnlas 
of P andT/)(L) — ]A \ L h A] is the theory closnre of L. For given two theories 
Fi andFQ w define Fi = Fq iff T/) (Fi) — T/j(F 2 ). A seqnence of formal theories 
is denoted by Fq, Fi, • • • , F*,, • • • or by {F*,}. 

Definition 2.1. (Li 1992) Let IJ, Fi, •••, F*,, ••• be a seqnence of formal 
theories. 
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OQ OQ 



OQ OQ 



r = n u r.., r* = U n r. 



are called the upper limit and lower limit of the above sequence respectively. 
rojFi, •••,r*;,---is convergent iff F* = F*. The limit of a convergent sequence 
is denoted by lim*;_joo F*;. 

Now w; are ready to give the following definition of a computation model 
for computing the limit of theory sequences in the first order languages. 

Definition 2.2. (Convergent agent) Fet F and Sk be finite sets of sentences 
in first order languages and 

Fq = F h; - ^(Ffe_i, Sfe) forA: > 1 

where is a procedure (which can be expressed by a Turing machine). 

Fet S — U'^-iSk and define 



^(r,s)-| 



hmfc_^oQ F , 
undefined, 



if lim*;_j.oo F*; exists 
otherwise. 



Then (^(FjS) is called a convergent agent F is called an initial theory 

Remark: (1) Note that in general (^(F, S) is not a result produced by the 
procedure ip because S is infinite and cannot be an input of a Turing machine. 
In fact in eah round of computation of cp the output of^ is some F*; which is 
an approximation of (^(FjS) and may not equal to (^(FjS). Strictly speaking 
the real numbers defined by power series such as e and tt are defined formally by 
convergent agents. Informally the limit can be viewd as a result of a generalized 
Turing machine which allows infinite inputs and produces infinite convergent 
sequence. (2) Sk is a formal description of the principle of observability of the 
world whi(h says that data of a scientific problem can be got whenever necessary. 
In practice Sk can be example sets (training sets) from huge databases on the 
networks. (3) If for some k, p(Tk, Sk+-\ ) does not halt then lj_|_i — T whereT 
denotes undefined computation. Thus for this case limk^ca^ k does not exist. 

Note that if S — 0 then the corvergent agent defines the classical compu- 
tation by a Turing machine. In other words it defines computations in closed 
world. The various resolution procedures are the typical examples for this case 
and they are used as a framework for theorem proving. If S 0 then the con- 
vergent agent defines a class of computations or s§' it defines computations in 
open world. The proofs of Fingenbuam’s theorem the exteitions used in default 
logic and the problems of kncwledge base maintenance specification capturing 
the rationality of inductive reasoning and a class of agerts used in the Internet 
can be defined by the convergent agents (Fi 1999) and (Fi and Ma 2000). 
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3 Convergent Agent GUINA for Indnctive 
Inference and its Rationality 

We show in this section that the indnctive processes can be modeled in the 
proposed framework. In (Li 1999) a comergent agent called GUINA was intro- 
dnced to generate convergent indnctive seqnences. In other words GUINA can 
be viewed as an effort to model antomated learning systems. 

First m need some basic definitions for presenting GUINA. A model M is 
a pair of < M, 1 > whereM is a domain I is an interpretation. Sometimes 
Mp is nsed to denote a model for a scientific problem p. T/)(Mp) denotes the 
set of all true sentences of Mp and is a conntable set. The interpretation of the 
Herbrand universe of L nnder Mp is called the Herbrand nniverse of Mp and 
is denoted by 'Rm^ ■ The interpretation of a Herbrand seqnence of L of Mp is 
called a complete instance seqnence of Mp it is denoted ly Em sometimes ly 

m }if no confusion can occnr. 

Suppose that Em ^ is the only thing which we know abont M and F is 

an initial version of a theory (or say an initial gness)to be started. Fnrthermore 
we suppose that the indnctive generalization indnctiie snfficient condition and 
revision rules are the only rules allowed to be nsed see (Li 1999). The reason 
of introdncing revision rule is because there is no guarantee that an indnctive 
consequence would never meet a rejection by samples. In this sense an inductile 
inference system is rational only if the sequence of versions of theories eventually 
converge to the truth of the problem. Thus the rationalij' of indnctive systems 
should be expressed by the question: Does there exist a convergent agent that 
for every sequences {F„} generated by the agent snii that Fi — F and 

lim T/)(F„) -T/)(Mp)? 

n—¥OQ 

In what follows w describe such a convergent agent GUINA briefly. Onr 
goal is to obtain all true sentences of M by using GUINA which takes Em and 
F as its inputs. An informal description of GUINA can be as follows and its 
formal definition is in (Li 1999). 

Let Fi — F. F„_|_i will be defined as follows: 

1. If F„ h Ai for some i then U_|_i — F„; 

2. If F„ h -^Ai since Aj- is positive sample and it must be accepted (^Ai 
has met a rejection by facts Ai) U_|_i is a maximal subset of F„ which is 
consistent with A. 

3. If neither 1 nor 2 can be done then U_|_i is defined by the induction rules 
(see (Li 1999)) as belw: 

(a) If Ai — B(t) and the indnctive generalization rule can be applied 
then F„+i is {Ap Va:.H( 2 :)} U F„; 
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(b) Otherwise if Ai = B and exists A siiii that h ^4 D and the 
inductive sufficient condition rule can be applied for A then is 

{A, A D 

(c) If neither case (a) nor case (b) can be done then bi_|.i is {Ai} U F„. 

The following theorem shows that GUINA is a convergent agent and the 
proof can be found in (Fi 1999). 

Theorem 3.1. Fet Mp be a model for a specific problem p T/)(Mp) be 
the set of its all true sentences be its complete instance sequence and F be 

a theory. If the sequence {F„} is generated by GUINA from and F then 
{F*;} is convergent and 

lim T/)(Ffe) -T/)(Mp). 

k—^OQ 

4 Convergent Agents over Algebraically Closed 
Fields 

As we have seen in Section .3 there is a problem in GUINA whir is as follows. 

Given the current theory F and a statement A hew to determine (I) FI A or 

(2) F h A or (3) V i/ A and F 1/ ^A. For the first order predicate logic there is 
no decision algorithm for solving this problem. In order to avoid this situation 
in (Fi and Ma 2QQQa) and (Fi and Ma 2QQQb) eninvestigated the computa- 
tional aspects of convergent agents in Algebraically Closed Fields (ALC). ALC 
is a typically complete and decidable theory inAiC there can be a decision al- 
gorithm to solve this problem. The main reasons for study of convergent agents 
over AFC is that it has the following advantages: 

(1) It has strong expressive powers in the sense that many scientific problems 
can be specified in its scope. (2) Since AiCis categorical i.e. all of its models are 
isomorphic w; can choose a specific model e.g. the complex mnber field when 
involving semantic approaches that is there is no need of the unification. (3) In 
ALC the formal theories can be syrtactically transformed into the systems of 
polynomial equations and some metric can be defined to allcw that convergent 
agents can compute analytically in a way like numeric computations. Therefore 
some symbolic and algebraic computation techniques and numeric computation 
techniques developed by (Wn 1986) and (Cuter and Smale 1999) can be used 
to compute the limits of theory sequences symbolically and numerically. 

5 Concluding Remarks 

Recently with the rapid developments of the Internet techniques and applica- 
tions it has been required to establish fnndamertal frameworks to deal with 
the running processes and massive unstructured data stored on the networks. 
For example searh engines should intelligently suit their users whose interests 
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are gradually changing and developingF the semartics of unstructured informa- 
tion contained in home pages should be made precisely in an evolutionary wayF 
and the knowledge hidden in massive data should be discovered in some non- 
monotonic mining processesF and so on. believe that the solutions of these 
problems would rely heavily on powerful analytic concepts and methodsF suh as 
limitsF calcnlusF measuresF and gradually expansion. The csergent agents as 
a computational framework could provide solutions to these problems. In factF 
introducing convergent agents is to model automated reasoning systems. 
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Abstract. Many real-life optimization problems such as planning and scheduling 
require finding the best allocation of scarce resources among competing 
activities. These problems may be modeled and solved by means of mathematical 
programming. This paper explores a distributed multi-agent approach to 
mathematical programming, and demonstrates the approach in the case of integer 
programming. The important characteristics of the multi-agent approach consist 
in that the behavior-based computation performed by the agents is parallel and 
goal-driven in nature, and has low time complexity. 

Keywords. Multi-agents; Integer programming; Behavior-based computation. 



1 Introduction 

A multi-agent system is a system of interacting heterogeneous agents with different 
reactive behaviors and capabilities. Some examples of multi-agent systems include an 
artificial-life agent system for solving large-scale constraint satisfaction problems 
developed by Han, Liu, and Cao [4], an evolutionary agent system for solving 
theorem-proving problems by Yin, Liu, and Li [5], and a reactive behavior-based 
image feature extraction system by Liu, Tang, and Cao [6]. As one of the most active 
research areas in Distributed Artificial Intelligence, multi-agent approaches have 
shown to have a great potential in solving problems that are otherwise difficult to 
solve. This is primarily due to the fact that many real-life problems are best modeled 
using a set of interacting agents instead of a single agent [1-3]. In particular, multi- 
agent modeling allows to cope with natural constraints like the limited processing 
power of a single agent and to benefit from the inherent properties of distributed 
systems like robustness, redundancy, parallelism, adaptability, and scalability. 

This paper explores a distributed multi-agent approach to mathematical 
programming, and in particular, demonstrates the approach in the case of integer 
programming. The motivation behind this work lies in that many real-life 
optimization problems such as planning and scheduling require finding the best 
allocation of scarce resources among competing activities under certain hard 
constraints. Such problems may be modeled and solved by means of mathematical 
programming. 
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2 Problem Statement 

In what follows, let us first take a look at a typical integer linear-programming 
problem: 

(IP,,) minimize C,Y, 

J 

subject to [— or = or — ]b, , i=l,2,...,m 

j 

and X J >0, Xj integer, j=l,2,...,n. 

where n is the number of variables and m is the number of constraints. 

In order to solve the above problem, generally speaking we can first eliminate the 
integer constraints and obtain a problem as follows: 

(IP,,) minimize z=^, C ,X, 

j 

subject to [— or = or ^]b,, i=l,2,...,m 

J 

and X > 0 

We call IP„ is a relaxation of IP„. Obviously, every feasible solution to IP„ is also a 
feasible solution to IP,, . We can apply a simplex method to solve IP„ , and therefore 
obtain a lower bound on the optimal value for IP„. If the solution happens to contain 
all integer components, it will be optimal for the original problem. Else, we obtain a 
non-integer solution. Let x,^ is a non-integer component, its value is a. Suppose that a, 
is the maximum integer less than a, and a^ is the minimum integer greater than a. 
Adding x,^< a, and x,^> a^ into the constraints of integer programming respectively, 
we can construct two new linear programming problems. The two new problems are 
as follows: 

(IP,) minimize z= , Y , 

j 

subject to [— or = or >]b,, i=l,2,...,m 

j 

x^ < aj 

and X J >0, x j integer, j=l,2,...,n. 

(IPj) minimize z= -X ■ 

j 

subject to ^^y-^i/ [— or = or >]b,, i=l,2,...,m 

j 

and Xj >0, Xj integer, j=l,2,...,n. 
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The above process of constructing new problems by adding constraints is called 
branching. The linear programs that result from branching are called sub-problems of 
IP,,. In other words, in order to solve IP,,, we only need to solve sub-problems IP, and 
IPj. Based on the idea of branching, we have designed a reactive behavior-based, 
distributed multi-agent approach to solving an integer-programming problem. An 
overview of this approach is given in the following section. 



3 An Overview of the Multi-agent System 

The general design of our multi-agent system contains three key elements: a goal of 
the system, agent environments, and a behavioral repository for the distributed agents 
in reproduction, reaction, and communication. We will refer to each of the distributed 
agent as G. 

Specifically, the reactive behaviors of an agent can be described as follows: 

1. Reproduction 

Each agent G can be reproduced into two agents G, and Gj. They will have the 
same goal as G, but not the same environments. We call G a parent agent, and its 
reproductions G, and G^ offspring agents. An agent can be a parent agent and an 
offspring agent at the same time. 

l.Action 

An agent G is always keeping on computing and evaluating its current state until 
either a goal state has been successfully achieved or a termination condition is 
satisfied. 

3. Communication 

Let us illustrate agent communication through an example. Suppose that there is 
agent G in environment E. It begins to compute its goal at the current state. If the 
result is false, it will start to reproduce and generate offspring agents. Let G, and G^ 
be two reproduced agents. When G reproduces, its environment E will also be divided 
into two sub-environment E, and E^, which serve as the environments for G, and G^, 
respectively. Next, G, and G^ compute their goal in their own environments. If any of 
them can successfully reach a goal state, it will send this result to its parent agent G, 
else it will continue to reproduce. On the other hand, the parent agent will decide 
whether or not its goal state is achieved based on the results returned by G, and Gj. It 
will also decide when to stop the reactions of G, and G^. 



4 Multi-agent Integer Programming 

Now, let us consider how the above agent model can be applied to solve an integer- 
programming problem. First, we regard the constraints of problem IP,, as the 
environment of an agent and the optimal solution of IP,, as its goal. At the first step, G 
executes its reaction, i.e., solving the relaxation problem of IP„ by means of a simplex 
method. If G can reach its goal state, that is, the result of its reaction satisfies the 
integer constraints, then G stops, else G begins to reproduce G, and G^, which 
correspond to sub-problems IP, and IPj. In a similar fashion, the reproduced agents 
continue to react and reproduce. 
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As in our implementation, the data structure for each agent is defined as follows: 

Define a variable z to store the optimal objective value, let z=+ °° . 

1. Goal and environment: 

Assign the optimal solution of integer programming to the goal of G and the 
constraints to the environment of G (only for the root agent). Define a variable z^^ to 
store the result of the agent’ s reaction. 

2. Reaction: 

A simplex procedure is performed to solve the relaxation problem of Goal, and 

(1) If the solution satisfies the integer constraints, then Zj^<— optimal value. If there 

is no feasible solution, then — i-oo . In these two cases, if the agent is a root agent, 

then z=Zj^ and stop, else send z^^ to its parent agent and then stop. 

(2) If the solution contains a non-integer component, then the agent sends the result 
to its parent agent with a non-integer label (for the root agent, let parent agent be 
itself). 

3. Reproduction: 

If a solution component Xj^ is non-integer and its value is a, let aj be the maximum 
integer less than a and a^ be the minimum integer greater than a, then agent G begins 
to reproduce its offspring agents, G, and G^. Both hold the same goal as G. Their 
environments are derived by adding constraints x^ < a, and x^ > a^ to the environment 
of G, respectively. 

4. Communication: 

Parent agent G^ communicates with its offspring agent G^ according to the 
following rule: 

Let z^ be the result of G^ being sent back to G^ 

(1) if Zj^ is a non-label result then z<—min(z, Zj^ ) 

(2) if Zj^ is a label result then, 

if Zj^ > z then stop G^ 
else request G^ to reproduce 

The process of solving an integer-programming problem begins with the reaction 
of a root agent, and stops when no agent can reproduce. At this time, if z=+ °° then 
there is no optimal solution for the integer programming problem, else z is the optimal 
value found. 



4.1 Time Complexity 

We have implemented the above multi-agent integer programming system. We note 
that if an integer programming problem is solved with a sequential processing 
method, the required time complexity would be 0(n), but on the other hand, with our 
proposed model, the time complexity would become 0(log n). 
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5 An Illustrative Example 



In this section, we present a walkthrough example of applying the above-mentioned 
approach; 

(IP„) minimize z=-Xj-5x2 (IP„ ) minimize z=-Xj-5Xj 
subject to subject to 



x,-x, >-2 
SXjH-bXj <30 
Xj <4 
x„X2>0 
XpX^ integers 



Xj-Xj >-2 

5XjH-6Xj <30 
X, <4 

x„Xj>0 



First, we assign the optimal solution and constraints of IP„ to the goal and 
environment of agent G. Initially, we set the goal state of G to be false. 

Because the goal state is false, G begins to execute its reaction, solving IP„ by 
means of a simplex method. For the sake of illustration, here we solve IP„ by using a 
graphical method. As shown in Figure 1, the optimal solution point is A. The solution 
component is Xj=18/ll, X2=40/ll, and the optimal value is z^^ =-218/11, which is a 
lower bound for IP„. 




Fig. 1. The linear programming problem 



Because the solution component is non-integer, e.g., x, is a non-integer, we obtain 
two integers 1 and 2, and then agent G begins to reproduce agents G,, G^, their goals 
are the same as G, and their environments are different by adding constraints of Xj <1 
and Xj < 2, respectively. 

G, and G^ continue to react to their goals under the supervision of their parent G, 
until the goals are reached or the termination condition is satisfied. 
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6 Conclusion 

In this paper, we have described a novel distributed agent approach to solving integer- 
programming problems. The key ideas behind this approach rest on three notions: 
Goal, Environment, and Reactive behavior. Each agent can only sense its local 
environment and applies some behavioral rules for governing its reaction. While 
presenting the agent model, we also provided an illustrative example. 

The advantage of the proposed approach can be summarized as follows: 

1. The reproduction and computation of agents are parallel in nature. 

2. The process of distributed computation is goal-driven. 

3. The time complexity is 0(log n). 
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Abstract. Intelligent agent technologies hold great promise for the fi- 
nancial services and investment industries such as portfolio management. 
In financial investment, a multi-agent system approach is natural because 
of the multiplicity of information sources and the different expertise that 
must be brought to bear to produce a good recommendation (such as a 
stock buy or sell decision). The agents in a multi-agent system need to 
coordinate, cooperate or communicate with each other to solve a com- 
plex problem. However, ontologies are a key component in how different 
agents in a multi-agent system can communicate effectively, and how 
the knowledge of agents can develop. This paper presents a case study in 
building an ontology in financial investment. The lessons we learned from 
the construction process are discussed. Based on our ontology develop- 
ment experience and the current development of ontologies, a framework 
of next generation ontology construction tools, which is aimed to facili- 
tate the ontology construction, is proposed. 



1 Introduction 

An intelligent agent is an encapsulated computer system that is situated in 
some environment and that is capable of flexible, autonomous action in that 
environment in order to meet its design objectives. Intelligent agent technologies 
hold great promise for the financial services and investment industries such as 
portfolio management. A collection of intelligent agents can be programmed to 
actually enter the Internet and carry out a sequence of instructions, searching 
through a number of sources to locate predetermined information, and making 
decisions based on these information. 

In financial investment, the tasks are dynamic, distributed, global, and het- 
erogeneous in nature. Take the financial portfolio management as an example, 
the task environment has the following interesting features: (1) the enormous 
amount of continually changing, and generally unorganized, information avail- 
able; (2) the variety of kinds of information that can and should be brought to 
bear on the task (market data, financial report data, technical models, analysts’ 
reports, breaking news, etc.); (3) the many sources of uncertainty and dynamic 
change in the environment. To deal with problems such as portfolio manage- 
ment, a multi-agent system approach is natural because of the multiplicity of 
information sources and the different expertise that must be brought to bear to 
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produce a good recommendation (such as a stock buy or sell decision). When 
solving a complex problem, the agents in a multi-agent system need to coordi- 
nate, cooperate, and communicate with each other. However, ontologies are a 
key component in how different agents in a multi-agent system can communicate 
effectively, and how the knowledge of agents can develop. 

An ontology is a theory of a particular domain or sphere of knowledge, de- 
scribing the kinds of entity involved in it and the relationships that can hold 
among different entities. An ontology for finance, for example, would provide 
working definitions of concepts like money, banks, and stocks. This knowledge 
is expressed in computer-usable formalisms; for example, an agent for personal 
finances would draw on its finance ontology, as well as knowledge of your par- 
ticular circumstances, to look for appropriate investments. 

Building an ontology for finance was initially motivated by our ongoing fi- 
nancial investment advisor project. As part of this project, we have built a 
multi-agent system called agent-based soft eomputing soeiety[l]. In this society, 
there are three kinds of agents-problem solving agents, serving agents, and soft 
computing agents. They work together to solve some problems in finance such 
as portfolio selection by using soft computing technologies as well as financial 
domain knowledge. A financial ontology is essential for these agents to commu- 
nicate effectively. 

Although intelligent agent technology holds great promise for the financial 
services and investment industries, up to now, there are not many papers pub- 
lished or products announced in financial field. Some typical multi-agent systems 
(by no means all) in finance include the Warren System[2][3], the Banker and 
Investor Agent System[4], and the Distributed Financial Computing System[5] 
etc. For these multi-agent systems in finance, there are not corresponding fi- 
nancial ontologies being used. It is no doubt that if there exists such a finance 
ontology, the development of financial multi-agent application systems should 
be much easier. The same situation is held for other application fields. 

Interest in ontologies has grown as researchers and system developers have 
become more interested in reusing or sharing knowledge across systems. There 
are some general-purpose upper ontologies such as CYC and Wordnet and some 
domain-specific ontologies that focus on the domains of discourse such as chemi- 
cals ontology and air campaign planning ontology (refer to [6] for an overview of 
the recent development of the field of ontologies in artificial intelligence). Until 
now, very few financial ontologies have been reported. In the Larflast project, a 
financial domain ontology is under construction and will be used for learning fi- 
nance terminology {http://www.linglink.lu/hlt/projeets/larflast-ineo/ar-99/ar99. 
html and [7]). In 13 (Intelligent Integration of Information, http://de.isx.eom/I3) 
project, there is a finaneial ontology and databases group. They are creating 
ontologies of financial knowledge in Loom (a kind of knowledge-representation 
language) that describe the contents of existing financial databases. We failed 
to find an existing financial ontology that can be (re)used in multi-agent envi- 
ronment. The lack of financial ontology that can be (re)used directly motivated 
us to build such an ontology. 
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In this paper, we will describe the development of the financial ontology 
as well as lessons we learned from the developing process. Based on these dis- 
cussions, a framework of the next generation of ontology construction tools is 
proposed. 

The remainder of the paper is organized as follows. Section 2 deals with the 
details of our financial ontology construction. The lessons we learned from the 
development process are discussed in Section 3. In Section 4, a framework of the 
next generation of ontology construction tools is proposed. Finally, Section 5 is 
concluding remarks. 



2 Construction of Financial Ontology 

We use Ontolingua to construct our financial ontology. Ontolingua is an ontol- 
ogy development environment that provides a suite of ontology authoring and 
translation tools and a library of modular reusable ontologies. For details on 
Ontolingua, visit http://ontolingua.stanford.edu. Ontolingua is based on a top 
ontology that defines terms such as frames, slots, slot values, and faeets. When 
we build the ontology using Ontolingua, we must define the terms such as port- 
folio, seeurity, share, stoek, bond etc. by determining the slots and giving the 
slot values. Before we can do this, we face a hard knowledge-acquisition prob- 
lem. Like knowledge-based-system development, ontology development faces a 
knowledge-acquisition bottleneck. 

Because we are not experts in financial domain, we first read some books in 
finance to get some basic concepts in financial investment. We then held prelim- 
inary meetings with financial experts to look for general, not detailed, knowl- 
edge. After this, we studied the documentation very carefully and tried to learn 
as much as possible about the domain of expertise (finance). Having obtained 
some basic knowledge, we started by looking for more general knowledge and 
gradually moved down into the particular details for configuring the full ontol- 
ogy. We extracted the set of terms and their relationships, and then defined the 
attributes and their values. At the later stage of knowledge-acquisition, we sub- 
mitted these to financial experts for inspection. During knowledge acquisition, 
we used the following set of knowledge-acquisition techniques in an integrated 
manner[9]: (1) Non-structured interviews with experts to build a preliminary 
draft of the terms, definitions, a concept classification, and so on; (2) Informal 
text analysis to study the main concepts in books and handbooks; (3) Formal 
text analysis. We analyzed the text to extract attributes, natural-language defi- 
nitions, assignation of values to attributes, and so on; (4) Structured interviews 
with the expert to get specific and detailed knowledge about concepts, their 
properties, and their relationships with other concepts; (5) Detailed reviews by 
the expert. In this way, we could get some suggestions and corrections from 
financial experts before coding the knowledge. 

This is a time consuming and error containing process. Obviously, more effi- 
cient construction tools are needed. After we acquire the knowledge, we manually 




Building an Ontology for Financial Investment 311 



code the knowledge with Ontolingua. Some terms of our financial ontology writ- 
ten in Ontolingua are as follows: 

; ; ; Securities 

(Def ine-Class Securities (?X) "A term that covers the paper 
certificates that are evidence of ownership of bonds, debentures, 
notes and shares . " 

:Def (And (Relation ?X))) 

; ; ; Share 

(Def ine-Class Share (?X) "A unit of equity capital in a company." 
:Def (And (Securities ?X))) 

Currently, you can log in to Ontolingua and check the financial investment 
ontology in the unloaded category. To use this ontology, we adopt the Open 
Knowledge Base Connectivity (OKBC) protocol[ll] as a bridge between the 
agents in our financial investment advisor multi-agent system and the financial 
ontology. Although we build this ontology mainly used in multi-agent systems, 
any other systems can access the ontology through the OKBC. This enables the 
reuse of this ontology. The ontology constructed by using Ontolingua is in Lisp 
format. Before we can access the ontology through OKBC, we must translate the 
ontology into OKBC format. This can be accomplished automatically by using 
the ontology server. 

3 Discussions-Lessons Learned 

By analyzing the ontology construction process described in Section 2 (it is a 
typical procedure followed by most researchers in this area), we can extract 
the following two points: (1) switching directly from knowledge acquisition to 
implementation; (2) manually coding the required knowledge for the domain of 
interest. It is these two points that cause the following disadvantages or problems: 
First, the primary current disadvantage in building ontologies is the danger of 
developing ad-hoc solutions. Usually, the conceptual models describing ontolo- 
gies are implicit in the implementation codes. Making the conceptual models 
explicit usually requires reengineering. Ontological commitments and design cri- 
teria implicit and explicit in the ontology code. All these imply that the built 
ontologies may contain errors, inconsistencies etc. Second, domain experts and 
human end users have no understanding of formal ontologies codified in on- 
tology languages. Third, as with traditional knowledge bases, direct coding of 
the knowledge-acquisition result is too abrupt a step, especially for complex 
ontology. Finally, ontology developers might have difficulty understanding im- 
plemented ontologies or even building new ontologies. This is because traditional 
ontology tools focus too much on implementation issues rather than on design 
problems. 

The source of these problems is the absence of an explicit and fully docu- 
mented conceptual model upon which to formalize the ontology. To this end. 




312 Z. Zhang, C. Zhang, and S.S. Ong 



some researchers have proposed ontological engineering [9]. Central to ontologi- 
cal engineering is the definition and standardization of a life cycle ranging from 
requirements specification to maintenance, as well as methodologies and tech- 
niques that drive ontology development. They have developed a framework called 
Methontology for specifying ontologies at the knowledge level, and an Ontology 
Design Environment (ODE). The knowledge acquisition result is not coded by 
target language, but represented by an intermediate representations. The knowl- 
edge in intermediate representation can be automatically converted to Ontolin- 
gua codes by using ODE. Using Methontology and ODE can alleviate some of 
the problems mentioned above. For example, at the later stage of our financial 
ontology development, only following the idea of ontological engineering (not 
accessing the ODE) speeds up the ontology construction. To overcome the diffi- 
culties mentioned in [8], more powerful tools or frameworks are still essential. 

4 Framework of Next Generation Ontology Construction 
Tools 

Ontology construction is difficult and time consuming and is a major barrier 
to the building of large-scale intelligent systems and software agents. It is clear 
that the creation of easy-to-use tools for creating, evaluating, accessing, using, 
and maintaining reusable ontologies by both individuals and groups is essential. 
Based on our ontology building experience and a relatively profound analysis 
of the current development of ontologies, we propose that the next generation 
ontology construction tools should include the following capabilities: Assemble 
and extend modules from ontology repositories; Adapt and reconcile ontologies; 
Extract and taxonomize terms from other sources; Semi-autonomously synthe- 
size ontologies based on the use of terms in natural language documents; Merge 
overlapping ontologies; Visualize ontologies; Detect inconsistencies; Browse and 
retrieve ontologies; Translate and reformulate. 

Currently, a research group at Stanford University is building ontology devel- 
opment and use technology that addresses seven (out of nine) of these needs [10]. 
We can use the current visualization technologies to visualize the spatial rela- 
tionships, temporal relationships, concept/document associations, and complex, 
aggregate patterns etc. in ontologies. The theory of logic programming is a pos- 
sible solution for adaptation and reconciliation of ontologies. Hence, our frame- 
work of next generation ontology construction tools is reasonable and actual. 
Tools with such capabilities will facilitate the rapid, accurate development of 
variety ontologies, and multi-agent systems as well as other knowledge based 
systems will also be better enabled for knowledge sharing and have much better 
interaction. 

5 Concluding Remarks 

Ontologies play a key role in how different agents in a multi-agent system can 
communicate effectively, and how the knowledge of agents can develop. Until 
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now, there are few collections of ontologies in existence; most of them are still 
under development. The same is true for ontologies in finance. To this end, 
we started to build a financial ontology used in multi-agent systems. We used 
Ontolingua as our construction tool. Other systems can access this ontology 
through OKBC. Currently, the construction of this financial ontology is still in 
progress. 

Experience with the financial ontology development and a relatively pro- 
found analysis of currently ontology research have led us to an extended and 
refined set of ideas regarding the next generation ontology construction tools. 
We proposed that the next generation ontology construction tools should have 
adapting and reconciling, visualizing ontologies, and detecting inconsistencies 
etc. nine capabilities. A framework with these nine capabilities is reasonable and 
actual. Tools with such capabilities will facilitate the rapid, accurate develop- 
ment of variety ontologies, and multi-agent systems as well as other knowledge 
based systems will also be better enabled for knowledge sharing and have much 
better interaction. 
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Abstract. The original event service of the Common Object Request 
Broker Architecture (CORBA) suffers from several weaknesses. Among 
others, it has poor scalability. Previously, we have proposed a framework, 
called SCARCE, which extends the event service to tackle the problem 
in a transparent manner. Scalability is improved through an agent-based 
load balancing algorithm. In this paper, we propose two new features to 
the algorithm that can further enhance its stability and performance. 



1 Introduction 

In many applications, such as video conferencing and Internet radio, the same 
piece of data is being disseminated from a single source to multiple receivers, 
effecting a multicasting model. Honoring this, the Object Management Group 
has proposed the event service [3], which defines a framework for decoupled 
and asynchronous message passing between distributed CORBA objects. In the 
event service, both the senders and the receivers, referred to as the suppliers and 
the consumers respectively, are connected to an event channel, which is actually 
an ordinary CORBA object. A supplier sends out data by invoking the “push” 
method of the event channel object, to be collected by the consumers on the 
other end. 




VI 




Q. 

Q. 
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Fig. 1. The SCARCE framework. 
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Despite its flexibility, the event service suffers from several shortcomings [4]. 
Among others, it has poor scalability. Unicasting is often needed to emulate mul- 
ticasting in many existing networking environments. Consequently, the efficiency 
of communication is degraded when the number of consumers and/or suppliers 
increases. To tackle this, we have previously extended the original event service 
into a new framework, known as SCARCE (SCAlable and Reliable Event ser- 
vice) [1] (see Fig. 1). Scalability is improved through the concept of federated 
channels (see Fig. 2). Precisely, event channels are replicated on-demand which 
are interconnected to give a two-level structure and the total load is shared 
among the replica through an agent-based dynamic load balancing algorithm. 



Subordinate 
event channels 



© 




Push 



52 

E 



Fig. 2. Federated event channel in SCARCE. 



2 Dynamic Load Balancing in SCARCE 

With reference to Fig. 2, each subordinate event channel is associated with an 
agent, known as the channel manager. Upon the creation of every event channel, 
the master server (in Fig. 1) will “broadcast” the object references of the new 
channel and its channel manager to all the existing managers. Periodically, a 
channel manager, say A, will pick up another channel manager B from the list 
of channel managers it knows. If the load of B is greater than that of A, A will 
notify one of its clients to switch to the channel managed by B. 

Our initial experience with SCARCE reveals that although the dynamic load 
balancing algorithm is effective in most cases, transient overloading may not 
be resolved quick enough, and a high fluctuation of performance is observed 
occasionally. In view of this, we propose two extensions to the original algorithm. 
First, an adaptive approach is applied for adapting the frequency at which a 
channel manager attempts to unload its clients. This makes the system more 
responsive to spontaneous overloading. On the other hand, instead of involving 
two agents only in each unloading interaction, multiple agents can negotiate 
together. This has the advantage that fewer unloading operations will be needed, 
thereby reducing negotiation overhead. At the same time, it helps to reduce the 
fluctuation in load. 
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3 Extensions to the Dynamic Load Balancing Algorithm 

3.1 Interaction Freqnency 

It is intuitive enough that with high interaction frequency, the load in the whole 
system will be more balanced, and it takes a shorter time to bring the system 
to a more stable state, upon a transient bursty arrival of clients, or the failure 
of an event channel object. It also takes a shorter period to relieve the load of 
existing channel objects should a new channel be added. The cost paid is the 
high interaction cost between the channel managers. 

We discover initially in our experiments that if the frequency of interaction 
is too high, the system will become unstable and fluctuation of load levels is 
observed. On the other hand, if the frequency is too low, the system will not be 
responsive enough in case the load level of a channel is being jerked up suddenly 
(due to factors such as network congestion). Furthermore, whether the frequency 
is too high or too low is also dependent on the load of the system as a whole. 
For a lightly loaded system, a shorter interaction period will be preferred, since 
clients can get migrated to a lightly loaded channel within a short period. On 
the other hand, a longer interaction period will be more appropriate in a heavily 
loaded system, since the cost for interaction could be too high, compared with 
the cost saved by migrating clients to lighter-loaded channels. 

In view of this, we propose to perform an adaptive mechanism in SCARCE 
to determine the operational interaction frequency, in such a way that the inter- 
action period can take advantage of the overall system loading. Our rationale is 
that, the effectiveness of load balancing interaction is measured by the amount 
of clients transferred per interaction. The more clients transferred per interac- 
tion, the more effective is the interaction and the higher frequency is more likely 
to benefit the system performance before reaching the local maximum. Initially, 
the interaction period is set to a default parameter Tmax- Let n be the number 
of clients transferred in the previous interaction. If n is equal to 0, r* = T^ax- 
Otherwise, = Tj_i x -, where w < 1 is a constant weight. We also curb the 
range of r* to stay within a prescribed system bound G [TmimTmax]- So if 
Ti < Train, We will Set Tj tO Train- 



3.2 Location Mechanism 

The location mechanism is required to determine the channel or a set of channels 
to take over one or more transferred clients. In the original SCARCE design, 
load negotiation is performed between a pair of channel managers. Furthermore, 
only one client is migrated per interaction at most. In such a case, the initial 
negotiation cost is 2r and the transfer cost is r-|-T, where r is the roundtrip delay 
and T the cost of migrating a client. A simple generalization is to transfer n > 1 
clients per interaction, thus bringing the cost to 3r -|- nT . A natural question 
is to whether such a pairing is effective. Again generalizing, one can consider 
the negotiation of load among m > 2 channel managers per interaction. This 
gives rise to a more complicated communication structure, but perhaps a more 
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effective transfer as more clients can be migrated from a heavily loaded channel 
within one single interaction to one or more lightly loaded channels. Furthermore, 
with such an arrangement, the interaction frequency can be reduced as well. 

With more than two channel managers, one can naturally adopt an m-to- 
m communication structure (everyone being equal), or a ring structure (with 
one initiator), or a star structure (with one initiator/coordinator). Intuitively, 
the cost with the m-to-m structure is very high, since there are totally 
interaction pairs that the number of socket connections is demanding. With the 
ring structure, the cost is mr for initial negotiation and the cost of transfer is at 
least r + nT and at most {m — l)(r -|- nT), depending on where the transferred 
clients finally go to. With the star structure, the cost is 2{m — l)r for initial 
negotiation and between r + nT and {m — l)r -|- nT for transfer. 

The advantage of the ring configuration is that it is as flexible as the fully 
connected configuration, since all clients can be migrated to any channel along 
the ring, though the number of steps can be more than one. A simple algorithm 
can roughly balance the workload within two rounds of propagation. Initiator 
first negotiates with channel managers along the ring, each of them reporting 
their workload. Upon collecting the workload of all members, the initiator (with 
sender-initiated algorithms) computes the average load and sends the excessive 
loads to the next member. A member with fewer clients than the average takes 
up some extra load to make up to the average and passes on the rest. A member 
with more clients than the average will give out its excessive load. 

The star configuration has the advantage of simplicity. The coordinator can 
also compute the average and migrate off the excessive tasks to the selected 
channels directly. However, the coordinator will have a higher workload than 
the ring configuration. 



C C 




(a) (b) 

Fig. 3. Multi-agent negotiation for dynamic load balancing. 



In our modification of SCARCE, we adopt the star configuration due to its 
simplicity and flexibility. Similar to the original design, at most one client can 
be migrated per interaction. Consider the example as shown in Fig. 3. Here, the 
value of m is equal to 4. Channel manager A acts as the coordinator, and the 
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other members include channel managers B, C and D. A starts the negotiation 
by collecting the load information of B, C and D by invoking the operation 
“checkJLoadO” (see Fig. 3(a)). It then compares the load levels of all channels 
involved. Suppose that the channel monitored by B is the most heavily loaded 
one whereas the load of the channel monitored by D is the lightest. If the load 
of B exceeds that of D by more than a threshold fraction 0/ A will notify B to 
unload one of its clients to D via the operation “shift JLoad(D)” (see Fig. 3(b)). 



4 Experiment 

In this section, we report the result of an initial experiment with the modified 
dynamic load balancing algorithm. Here, n is set to 1, meaning that only one 
client can be transferred per interaction. The value of the parameter uj used for 
adapting the interaction period is equal to 0.95, and the values of Tmax and Tmin 
are set to 5 seconds and 1 second, respectively. Load negotiation is performed 
among a group of 3 channel managers each time (so the value of m is equal to 
3), which are arranged in a star configuration. 




Fig. 4. Result of the experiment. 



^ Note that if the value of 6 is too small, the system will become unstable, whereas if 
it is too large, the responsiveness of the load balancing algorithm will be sacrificed. 
In the experiment, 6 is set to 10%, which is determined empirically. 
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There is one data source which pushes one data frame into the master channel 
every 250 milliseconds, and the size of each frame is 2K bytes. A total of 180 
clients are used which are divided into groups of 10 clients each. The client 
machines used are SUN UltraSparc-5 machines. Five subordinate event channels 
are pre-created, each running on a separate SUN UltraSparc-5 machine. The 
result is shown in Fig. 4. 

Initially, a burst of 100 clients subscribe to the master server during a very 
short period of time (near point A in Fig. 4). The master server then assigns 
them to the subordinate channels based on their load levels [1]. Precisely, every 
time when a new client arrives, it is assigned to the channel currently with the 
lightest load. However, the load information of the channels can be “old” , in the 
sense that it cannot reflect the actual current loads of the channels accurately. 
The reason is that, since the clients all arrive in a very short period of time, 
after a client is assigned to the channel currently with the lightest load, the 
channel may not have enough time to update its load indicator before the next 
client comes in. The same channel will thus be selected again. Consequently, a 
large proportion of the burst will likely be assigned to the same channel. This is 
commonly called the herd effect [2]. Doubtlessly, an uneven distribution of load 
will be resulted. As depicted, this is gradually rectified through dynamic load 
balancing. 

Later, bursts of 50, 30 and 20 clients subscribe to the system near the points 
B, C and D, respectively, jerking the load levels of some of the channels drasti- 
cally as a result. As shown, dynamic load balancing is effective in bringing the 
system back to “equilibrium” again in each case. Note that near point E, the 
load level of the channel 5 is pulled up by a sudden increase in the processing 
load of the machine running the channel. This is gradually brought down later. 

5 Conclusion 

In this paper, a multi-agent load negotiation algorithm has been proposed for the 
SCARCE framework. Initial experiments reveal that the algorithm is effective in 
enhancing the dynamic load balancing function of SCARCE. We are currently 
looking into the issue of allowing multiple clients to be migrated per interaction, 
thus further improving the performance of the system. 
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Abstract. A market of two-dimensional agents with geographical con- 
straint is modeled with the soap froth analogy and numerical simulations 
have been performed using a cellular network generated by Voronoi tes- 
sellation. By tuning the noise parameter and the ratio of relative interac- 
tion between agents, different phases corresponding to the colonization 
by monopolies as well as fair market are observed. From the analysis of 
the distribution of minority agents surviving in the system, we find that 
some of the smaller cells can be occupied by the minority agents, while 
larger cells are being taken over by the surrounding majority agents. 



1 Multi- Agent Systems 

Multi- Agent Systems (MAS) is a kind of Distributed Artificial Intelligence (DAI) 
where each agent is capable of automatically altering its internal states in order 
to achieve a global objective under the influence of changing environment. In this 
paper, we describe a system of agents, individually belonging to either one of two 
groups, and distributed initially randomly on a two-dimensional trivalent cellular 
network generated by Voronoi tessellation of the plane. Each agent belongs to a 
particular cell of n-edges, and interacts with the n nearest neighbors such that 
agents cooperate if they belong to the same group, and otherwise compete. This 
is a typical example of chain-shop companies, such as banks, fast-food shops, 
supermarkets, which try to achieve dominance in the market share. 

2 Soap Froth Analogy 

In many studies of social phenomena, cellular automata is often used for numer- 
ical works to test certain model of social behaviors [1]. One of the drawback of 
this technique is that all simulations are performed on a regular lattice which is 
hardly the case for real business environment. Indeed, the neighborhood of each 
site in cellular automata is also fixed and cannot be altered to accommodate 
other configurations. For instance, inside a distributed sale market, agent com- 
panies are in general distributed not in a regular grid space and the number of 
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neighbors for each agent to interact can be different. Thus, our attempt in using 
the soap froth analogy is to overcome this critical shortcoming of the approach 
of cellular automata in dealing with multi-agent systems. 

We earlier introduced a new simulation model — multi-soap-agent system [2] 
— which is mainly based on topological relationships and maximization argu- 
ments. By relating the physical properties of soap froth and their topological 
relationships [3-5] using the shell model, a reasonable configuration of agents 
is constructed so that the problem of stability, scalability and fairness can be 
determined. Normally, a multi-agent system is globally approaching to a quasi- 
equilibrium state provided that there is no external stimuli to upset the system, 
although locally each agent is still influenced by the interactions with its neigh- 
bors. The advantages of our multi-agent model are that it provides an irregular 
lattice without fixing the number of neighborhood and the interactions are lo- 
calized. 



2.1 Agent Type 

Consider a multi-agent system consisting of different agent groups, with agents 
in each group distributed over a subset of the given two-dimensional cellular 
network. Based on the resources, agents provide services to the customers who 
are located on the two-dimensional plane. Due to different strategies adopted 
by each agent groups, customers have different degrees of affinity to stick on 
one agent. Geographical information also affects the choice of the customers. If 
the agent is situated near the customer, the customer tends to request services 
from the agent nearby. Once an agent is losing more customers, the domain of 
influence of the agent shrinks, corresponding to the shrinkage of bubble size in a 
soap froth. Indeed, when all the business is taken over by its neighbors, implying 
that the customers are all preferring not to demand service from that agent, 
the agent will eventually disappears in the market. Inside the system, there is 
a certain probability for a new agent emerging in the system. Generally, new 
agent buds at the boundary among agents because the customers in that region 
are equally farthest from the old agents, or equally serve unsatisfactorily by old 
agents. 

This kind of multi-agent systems can be easily mapped into a cellular network 
model using the soap froth analogy. Each bubble is occupied by one agent and 
the number of neighbors that this agent interacts is the number of edges of 
the bubble. The area of a bubble corresponds to the resources of the agent. The 
perimeter of the bubble represents the number of customers that this agent must 
compete with its n neighboring bubbles for dominance of the market, since those 
points on the perimeter is of equal distance to the centers of the n-l-1 bubbles. 
According to the diffusion processes and the von Neumann’s theory [6] describing 
these processes in soap froth, the area of the bubble will evolve according to the 
following equation. 



dAj 

dt 



k{£ — 6 ) 



( 1 ) 
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where k is a positive constant related to the physical properties of the cell, A 
is the area of the cell with f edges. The system is gradually attaining to an 
asymptotic stable state. Throughout the evolution, two quantities, average area 
of a bubble with the number of sides i and the average perimeter of a bubble 
with the number of sides I, are found to obey scaling laws. [7, 8] This corresponds 
to the fact that the demand of resources and the supply of services are strongly 
correlated to the number of neighbors surrounding the agent. 

2.2 Model Construction 

For simplicity, this paper only considers the quasi-static model. The topology 
of the distribution of agents are fixed without the birth and death processes 
of cells. The conhguration of the pattern is based on a topology, calculated by 
Voronoi tessellation. [9]. Firstly, a set of points is randomly placed in an area. If 
each point isotropically emits rays at the same speed, then the locus of meeting 
points for the rays form the boundary of the cells. Mathematically, the Voronoi 
construction can be expressed as 

Vor(p) = {a; e I ||a; — p|| < \\x — p'||,Vp' £ S/p} (2) 

Under Voronoi tessellations, all the local customers are bounded by the cell of 
the nearest agent. In practice, Voronoi tessellation is a little bit harder problem 
to compute directly and we use the dual graph of Voronoi diagrams, Delau- 
nay triangulation. The definition of Delaunay triangulation of a point set is a 
set of triangulations such that no other points of the set are inscribed by the 
circumcircle of the three points of a triangle. 



T>e\{piPjPk) = {Ap.pjpk I 3c e ||p-c|| > ||p^-c||,Vp £ Sjp^,p, = i, j,/c} (3) 

The face of the Delaunay triangle (i.e. the center of the circumcircle) corresponds 
to the vertex of the Voronoi cell. The vertex of the Delaunay triangle is mapped 
to the face of the Voronoi cell. The edges of the Voronoi cells are just the per- 
pendicular bisector of the edges of the Delaunay triangle. Hence a configuration 
of two-dimensional trivalent cellular structures is constructed as shown in the 
Figure 1. 

3 Simulations 

Identifying an agent belonging to a specific group, we label each agent by a 
particular color. Different colors of agents behave differently according to the 
strategy played by the group. Agents belonging to the same group cooperate 
each other in order to compete agents from different groups. 

In our simulation, we are concerned with two agent groups only. One group 
is colored red and another blue. Initially, the agents of both colors are randomly 
assigned to the cells. The interaction of neighboring agents results in a dynamics 
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Fig. 1. Voronoi Diagrams & Delaunay Triangulations 



of color switching. With the analogy of soap froth, the energies (or business 
clout) of an agent are dehned as 



^same ^ ^same 

£diff oc 7C7Tdiff (4) 

where Esamei Sdiff are the interaction energies of the boundary shared by the same 
and different color agents, respectively. Tsame, .^^diff are the lengths of the bound- 
ary shared by the same and different color agents, a is the intrinsic strength of 
the agent (i.e. the surface tension of real soap froth), a and 7 are the strengths of 
the bonding energy of the same and different colors. We define a relative strength 
parameter of different color interaction to same color interaction as x = 7/a. 
The energy of an agent i is then equal to the difference of the weighted same 
color interaction energies and the weighted different color interaction energies. 

Ei — rTZsame^same ^diff^diff (h) 

where rusame and maiff are the number of neighboring agents belonging to the 
same and different groups. 

Let Pi be the probability of customers originally belonging to the agent i 
decide to switch to an agent of different color, and Qi = I — Pi he the probabil- 
ity of customers that will not switch their affiliation. The color pattern on the 
cellular network evolves according to the Boltzmann distribution of switching 
probability as 



veMi 



( 6 ) 



where Mi is the set of neighboring agents of the agent i and j 3 is the noise factor. 



4 Results 

Four hundred points are randomly generated on a square region of two- dimen- 
sional plane. Using Delaunay triangulation described above, we can compute the 
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Voronoi tessellation. To avoid the boundary and finite size effect, the square re- 
gion is cut into four quadrants and then are duplicated on complementary sides. 
Based on the switching dynamics, the system undergoes 1000 Monte Carlo steps. 
From our previous results, the system attains to different phases [10] with the 
variation of the noise factor (3 and the relative strength interaction ratio x. In- 
side the phase diagram, different phases of colonization of monopolies and fair 
market are found existing in the multi-agent system. 

To examine the conhguration of the distribution of agents in detail. We ob- 
serve that when color dominance appears, the majority agents (those with the 
dominant color) like to coalesce together to form a larger cluster. Minority agents 
(those with color different from the dominant color) can still survive between the 
large clusters inside the system. Indeed, we observe that the probability of agents 
in decagons with the dominant color is much higher than in squares, which are 
smaller. In the Figure 2, it shows that majority agents mainly dominates in the 
large cells while most minority agents surviving in the smaller cells. 



400 Voronoi Cells with Partial Periodic Boundary Condition 




Fig. 2. Minority of agents presents in smaller sizes of cells 



5 Discussions 

This phenomenon can be explained by considering the model of multi-agent 
system. The driving force of an agent acquiring business in the market is the 
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minimization of the cost and the maximization of the profit. By the Equation 4, 
the interaction energy is proportional to the length of the same color and different 
color bubbles. It implies that the cost is correlated with the amount of the 
customers. The more customers open for competition, the larger the interaction 
energy between the agents. Therefore majority agents have a larger energy in 
bigger cells to convert the customers on the perimeter to change their affiliation, 
thereby switch color to the dominant color. 

On the other hand, for small cells, the perimeter is also small, implying that 
the smaller amount of customers does not appear sufficiently attractive to the 
majority agents to spend resources over them. This provides an opportunity 
for the minority agents to survive in the smaller cells. These results elucidate 
why small size companies are still able to operate in the market even under 
the strong influence of big companies. However, if the company does not have 
sufficient resources to compete against others, especially the big companies, it is 
hard to survive in the major market. 
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Abstract. Watkins’ Q-learning is the most popular and an effective 
model-free method. However, comparing model-based approach, Q-learning 
with various exploration strategies require a large number of trial-and- 
error interactions for finding an optimal policy. To overcome this draw- 
back, we propose a new model-based learning method extending Q- 
learning. This method has separated El and ER functions for learning 
exploitation-based and exploration-based model, respectively. El func- 
tion based on statistics indicates the best action. The another ER func- 
tion based on the information of exploration leads the learner to well- 
unknown region in the global state space by backing up in each step. 
Then, we introduce a new criterion as the information of exploration. 
Using combined these function, we can effectively proceed exploitation 
and exploration strategies and can select an action which considers each 
strategy simultaneously. 



1 Introduction 

Reinforcement learning is an effective learning in unknown environment, 
where a supervisor cannot support the learner. The learner learns an optimal 
behavior through trial-and-error interactions with a dynamic environment. In 
reinforcement learning problem, each time the learner performs an action in its 
environment, a trainer may provide a reward or penalty to indicate the desir- 
ability of the resulting state [I]. The learner is told only the reward, but is not 
told whether or not the action of selected is best. This means that the learner 
must explicitly explore its environment. So, the learner must balance between 
exploitation and exploration. 

Many reinforcement learning algorithms has been proposed in Markov Deci- 
sion Processes (MDP) environments, these algorithms can be classified into two 
kinds, model-free method and model-based method. Model-free methods learn a 
policy or value function without explicitly representing a model of the controlled 
system. Model-based methods learn an explicit model of the system simultane- 
ously with a value function and policy [2]. Atkeson et al. [2] compared these 
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two methods according to two measures of data and computing efficiently and 
showed that model-based methods is more efficient than model-free methods. 

Q-learning [3] is the most popular and an effective model-free method. How- 
ever, comparing model-based approach, Q-learning with various exploration strate- 
gies [4], [5], [ 6 ] need a large number of trial-and-error interactions for finding an 
optimal policy. To overcome this drawback, we propose a new model-based learn- 
ing method extending Q-learning. This method has separated El and ER func- 
tions for learning exploitation-based and exploration-based model, respectively. 
El function based on statistics indicates the best action. The another ER func- 
tion based on the information of exploration leads the learner to well-unknown 
region in the global state space by backing up in each step. Then, we introduce a 
new criterion as the information of exploration. Using combined these function, 
we can effectively proceed exploitation and exploration strategies and can select 
an action which considers each strategy simultaneously. 

2 Q-learning 

In this paper, we consider only the case in MDP environments which state 
set S and action set A are finite. In MDP, the transition probability p and the 
expected reward r depend only on current state and action, not on earlier states 
or actions [ 1 ]. 

Watkins’ Q-learning [3] is the most popular and an effective model-free 
method. In Q-learning, the learner works estimating and evaluating following 
Q- value from its experiences (sj, Oj, Sj+i, rj+i). 

Q{st,at) -5- (1 - a)Q{st,at) +a{rt+i -b 7 max <3(st+i, o')) ( 1 ) 

a' GA 

, where a (0 <7 < 1 ) denotes a learning rate and 7(0 < 7 < 1 ) is a discounted 
factor which controls the rate between immediate reward and further one. 
Each Q-value will eventually converge to the true Q-value Q* for all state s 
and action a [3]. Selecting an action is based on current Q-values. Two selecting 
method which controls an exploitation vs. an exploration is well-known [4], [5], [ 6 ]. 

e-greedy With probability e, an action is selected randomly. On the other hand, 
the best action which has the largest Q is selected with 1 — e probability. 
Boltzmann exploration The probability p{a \ s) of taking action a in state s 
defined as follows: 



p{a I s) = 



exp 



Q{s,a)/T 



Y^a'eA exp'3(»’“')/^ 



( 2 ) 



, where T is a temperature parameter which can be decreased over time to 
decrease exploration. 



3 El and ER combined Learning 

In this section, we explain our exploitation-based and exploration-based com- 
bined learning (combined learning) algorithm extending above Q-learning. In 
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our learning, we adapt a model-based approach using El and ER functions. Let 
us start with describing how to build exploitation-based and exploration-based 
model formulating El and ER functions, respectively. Then, we will describe the 
combined function Qcomb of these functions. Furthermore, we will discuss the 
weight parameter which connects between El and ER functions. 

3.1 El and ER functions 

First, El function represents the mean of the expected sum of reward which 
the learner will receive if the learner executes the optimal policy in state s. We 
define it recursively as 



XEi{st,at) ■«- (1 - a)xEi{st,at) + amaxEI{st+i,a) (3) 

aeA 

EI{st,at) t- f{st,at) -b Ot) (4) 

, where r(s, o) is the weighted mean of reward. Recall a and 7 denotes a learn- 
ing rate and a discount factor, respectively. The approximate El function is 
approaching to the true El function El* asymptotically with many episodes. 

Second, ER function denotes the expected worth of taking explorational ac- 
tions about current transition and further one. We express it recursively as equa- 
tion below. 



XER{st, at) - 5 - (1 - a)xER(st, at) + a'Y^ ER(st+i,a) ( 5 ) 

aG A 

ER(st,at) t- e(st,Ot) -b 7*Bij(st, Ot) (6) 

The exploration-based information e(s, o) is a important factor to realize efficient 
exploration. In the next section, we introduce a new criterion in detail. 

In each step, to calculate above El and ER functions, the learner needs to 
update some adequate statistics as follows: 



^ r(st, at) t- (1 - a) ^ r(st, at) + n+i (7) 

^r^{st,at) t- (l-a)^r2(st,Ot)-br2+i (8) 



Using these statistics, we can calculate the weighted mean r(s, o) of reward and 
the weighted variance (t^(s,o) of reward as follows [7] : 



r{s,a) 
(T^(s, o) 



J2r{s,a) 
n(s, a) 




(r^(s,a) 



(r(s,a))\ 

n(s, a) 



(9) 

( 10 ) 



, where n(s,a) = and n(s,a) denotes the number of selected 

action a in state s. Thus, the El and ER functions has a different information 
are learned separately. With many episodes, El and ER informations propagate 
from the goal to the start. This means that El and ER functions lead to the best 
policy and well-unknown region in the global state space, respectively. 
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3.2 Exploration-based information 

Now that we discussed ER function’s work. In ER function, the exploration- 
based information e(s, o) has a important role to realize efficient exploration. 

Suppose the stochastic reward r (s, o) follow a normal distribution, the weighted 
variance of maximum likelihood estimate f (s, o) satisfies below by Cramer-Rao 
inequality. 



V[f{s,a)] > 



cr^(s, o) 
h(s, o) 



We use its minimum as exploration-based information, that is. 



( 11 ) 



e(s, o) 



D 



cr(s, o) 
\/«(s,o) 



( 12 ) 



3.3 Combined Pnnction 

Instead of Q-value in Q-learning, we use the following combined function 
Qcomb{s, O') of El and ER functions for evaluating and planning. 



Qcomb{s,a) = EI{s,a) + u}{s,a)ER{s,a) (13) 

, where w(s, o) is the weight parameter which controls a ratio between exploita- 
tion and exploration. In this way, we consider exploitation-based and exploration- 
based informations simultaneously. 

3.4 Control Weight 

In this section, we take a close look at the weight parameter. Consider the 
learner reaches a state s. To encourage behavior that tests long-untried actions, 
we use Sutton’s ’’bonus” concept [5] as w(s, o). If the action a has not been tried 
in m step, we define the weight parameter w(s, o) as 

u}{s,a) = K'l/m (14) 



, where k is a small parameter. 



4 Experiments 

We examined the performance of our learning method in experiment and 
compare with above Q-learning and Sutton’s Dyna-Q-I- [5]. To compare briefly, 
we employed e-greedy selection for these algorithms. In each algorithm, e is 
decayed by e iog( 2 +epjgode) discount factor 7 was set to 0.95. We set 

the other parameters of each algorithm as follows. 



Q-learning The learning rate a is decayed by a 
Dyna-Q-I- The learning rate a is decayed by a -e i^g 2 { 2 +epi 



log 2(2+epts 
1 



ode) 



de) 

. The iterations 



number N was set to 5. The bonus reward parameter k was set to 0.2. 
Combined learning The learning rate a is decayed by a ^ ^ 

The parameter k was set to 7.5. 



log 2(2+episo(ie) 
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To compare each performance in delayed-reward environment, we tested these 
algorithm on Sutton’s environment [8] (Taskl) as Figure 1 and Tree-type envi- 
ronment (Task2) as Figure 2. These environments have a difference characteristic 
each other in delayed-reward. In each Figure, circles represent the states of envi- 
ronment, narrow arrows are state transitions, the upper number of narrow arrow 
represents action, and the lower number of narrow arrow represents the proba- 
bility of state transition. Wide arrows with the number 10 represent reward. 

5 Results 

Let us compare combined learning performance to the other performances. 
Result 1 as Figure 3 represents the average reward received over a sequence of 300 
steps, averaged over 50 runs. Result2 as Figure 4 represents the average reward 
received over a sequence of 600 steps, averaged over 50 runs. X-axis shows the 
number of step and Y-axis shows the average reward per step. From each result, 
we can found the superiority of combined learning. We can see that, in both case, 
combined learning is much efficient than the other algorithm in delayed-reward 
environment. 



0.6 0.6 0.6 0.6 







0.6 0.6 0.6 rTTl 0.6 



a b 

Fig. 1. Sutton’s Environment [8] 




Fig. 2. Tree-type Environment 





Fig. 3. Performance for Taskl Fig. 4. Performance for Task2 
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6 Conclusions 

We proposed a new mode-based approach extending Q-learning. This method 
has El and ER functions for evaluating exploitation-based and exploration-based 
information, respectively. These functions are learned separately by backing up 
in each step. Using combined these function, we can effectively proceed exploita- 
tion and exploration strategies and can take an action which consider each infor- 
mation simultaneously. We tested the performance of our learning, comparing 
with the other learning algorithm in two environments. As shown in each result, 
we found the superiority of our learning. 

In summary, our findings for El and ER combined learning as follows : First, 
backing up the exploitation and exploration information, combining learning 
effectively leads the learner to the best action in exploration strategy and well- 
unknown region in global state space in exploration strategy. This means the 
combined learning efficiently proceeds exploitation and exploration strategies. 
Second, we can consider exploitation and exploration informations simultane- 
ously. Third, this algorithm is relatively easy to implement. Therefore, it is a 
flexible and useful on some applications. 
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Abstract. Artificial life simulations of social situations are a relative 
new field which aims to model situations which are too complex to be an- 
alytically investigated. In this paper, we develop commuter-agents with 
simple probabilistic models of the world and show that such agents can 
develop cooperation which aids the society as a whole. We show that 
there are situations in which the more powerful agents are sometimes 
forced by their greater knowledge into taking a lower utility than the 
weaker ones. In the last series of experiments we show that agents which 
have the ability to predict others’ road usage can materially improve the 
utility of the population as a whole. 



1 Introduction 

Wang and Fyfe [2] suggested an interesting N-player game based upon a simple 
simulation of traffic flow in a city, where players decide whether to take the car 
or bus every morning with different payoffs depending upon their opponent’s 
choices. They used an evolutionary approach and showed that cooperation in 
the society as a whole could evolve to the extent that an individually selfish 
population evolved to the Nash equilibrium point where no individual could 
improve his utility by changing his mode of transport within the current pop- 
ulation. Here, we look at an alternative probabilistic approach to the problem 
using probabilistic agents to represent the commuters in the game. 

Firstly, we expect our agents to behave rationally, that is they always choose 
the action that maximises their expected utility based upon the information 
available to them. Each agent also has a view of the world based on his mem- 
ory of his previous experiences in the world. Each individual can calculate the 
probability that there will be i car users in the rest of the population from 
P(other cars == i) = where c is the number of times he remembers i other 
users to have used the car in his memory space of ML memories. Using the max- 
imum likelihood number of cars, he can then calculate his utility if he should 
take the car or the bus. Each agent begins with a prior probability for each i 
equal to 

If we constrain every agent to use the same model and initialisation, we would 
expect that each agent would always take the same decision as every other agent 
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in every round, as they are all making rational decisions based upon identical 
evidence. To bring about individuality, we need to introduce differing design 
aspects in the agents, such as those suggested below: 

Different histories - players could be allowed to have different memory lengths. 
The players with longer memories may be able to assign more accurate prob- 
abilities to events and therefore calculate more accurate expected utilities. 
Additional information - this is one of the most common aspects of prob- 
abilistic game theory. Some players will have a more complex game repre- 
sentation than other players i.e. players may model additional events and 
therefore have a different level of evidence compared to the other players. 

We have previously investigated imperfect information in the Iterated Pris- 
oner’s Dilemma [1] which may be considered an abstraction of such social prob- 
lems as traffic flow. In this paper, we shall examine the effect of the different 
modelling attributes in a series of experiments and show that, counter intuitively, 
it is not always the agent with greatest processing power who gains most from 
social interactions. 

2 Experimental Investigation - History Lengths 

Initially, we define 10 identical agents to take part in the game, all having a 
memory length of 1 - that is, they may only remember the result of the single 
previous round. Each agent uses the same utility evaluation functions: 

U(car) = 100-(10*m), and U(bus) = 60-(4*m) 

where m represents the number of agents in the population who use cars. 

The maximum utility for the entire population exists when every player takes 
the bus - however, each player will only take the decision that maximises their 
own personal utility. Each agent can maximise their personal utility by taking 
the car if there are 6 or less other car users. It is easily shown that the Nash 
Equilibrium point for this utility function is around 7 car users; at this point it 
does not pay a bus user to switch to car use nor a car user to switch to bus use. 

When all agents had a single-round memory, we observe that all players 
choose the car in the first round, and thereafter strictly alternate between car 
and bus every round, with every agent always mimicking one another. Every 
agent scores an average utility of 35 per round, with the overall population 
average utility being 350. 

In experiment 2, we then individualise memory size, the first agent having 
length 1, the second 2, and so on, we find that the average population utility 
score tends towards 350 as in the previous experiment, however car and bus 
journeys are differently distributed amongst the agents. In total, car journeys 
represent 70.5% of trips, which is close to the Nash equilibrium described earlier. 

If we examine the converging behavior of our agents, we can clearly see how 
each agent’s strategy dovetails with the other players. The players who gain the 
highest individual utilities are agents 2,6,9 and 10, who quickly tend to take the 
car almost every journey. However, it is interesting to note that agent 4 switches 
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Experiment 2 


Experiment 3 


Agent 


Memory 


Takes Car 


Ave. Utility 


Agent 


Memory 


Takes Car 


Ave. Utility 


1 


1 


0.49 


29.78 


1 


1 


0.12 


31.06 


2 


2 


0.96 


38.76 


2 


3 


0.87 


36.64 


3 


3 


0.53 


30.04 


3 


5 


0.14 


31.1 


4 


4 


0.06 


31.62 


4 


7 


0.11 


31.22 


5 


5 


0.53 


30.1 


5 


9 


0.96 


37.96 


6 


6 


0.98 


38.86 


6 


11 


0.96 


37.9 


7 


7 


0.55 


30.5 


7 


13 


0.97 


38.1 


8 


8 


0.98 


38.86 


8 


15 


0.98 


38.3 


9 


9 


0.99 


39.0 


9 


17 


0.98 


38.3 


10 


10 


0.98 


38.8 


10 


19 


0.98 


38.3 



Table 1. The left table shows the number of times each agent (with different memory 
lengths - Experiment 2) takes the car and his average utility during the experiment. 
The right table shows corresponding results for Experiment 3. 



to take the bus every turn, but this strategy is still an improvement on agent 
1, who has the shortest memory length and performs most poorly - he simply 
alternates between taking the car and bus every turn. However, we also see that 
agent 5 switches to this same strict alternating strategy after turn 13, as does 
agent 7 after turn 21. The overall population of agents quickly settle down to an 
alternating pattern of car use after turn 21, where there are 9 car drivers on the 
even turns and 5 car drivers on the odd turns. 

If we change the memory lengths (Experiment 3) so that the first agent has 
memory length 1, second 3, third 5, and so on with agent 10 having a memory 
length of 19. All agents in the population have odd-valued memory lengths. After 
100 turns, we found the following: 

— Average Total Utility: 358.88 

— Average Car Journeys: 0.707 

— Average Bus Journeys: 0.293 

We can see that the utility in this experiment was slightly higher than previously, 
although the number of overall car journeys was very similar. Results are shown 
in the right half of Table 1. Again we observe that some agents continually 
choose to take the bus, whilst the others choose to always take the car and score 
more highly. If we examine the actual number of cars per turn, we observe an 
equilibrium being reached, although this time stability is obtained at 7 cars after 
31 turns. 

Repeating this experiment with memory lenghts 2, 4, 6 etc. so that all mem- 
ory lengths are even-valued, we obtain similar results. 

Initially, it would appears as if there is a definite relationship between in- 
creased memory length and greater utility gain: the agents with longer memories 
tend to gain higher utilities. However in Experiment 4, we change the pay-off 
functions to 

U(car) = 100-(10*m) and U(bus) = 80-(4*m) 
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so that now, relatively fewer cars can be used before using the bus represents 
higher utility gain. 



Experiment 4 


Experiment 5 


Agent 


Memory 


Takes Car 


Ave. Utility 


Agent 


Memory 


Takes Car 


Ave. Utility 


1 


3 


0.45 


55.9 


1 


2 


0.82 


67.68 


2 


4 


0.43 


56.56 


2 


4 


0.89 


67.74 


3 


3 


0.45 


55.9 


3 


6 


0.92 


68.28 


4 


4 


0.43 


56.56 


4 


8 


0.93 


68.7 


5 


3 


0.45 


55.9 


5 


10 


0.07 


63.06 


6 


4 


0.43 


56.5 


6 


12 


0.1 


62.82 


7 


3 


0.45 


55.9 


7 


14 


0.07 


62.88 


8 


4 


0.43 


56.56 


8 


16 


0.05 


62.88 


9 


3 


0.45 


55.9 


9 


18 


0.05 


62.88 


10 


50 


0.03 


63.34 


10 


20 


0.05 


62.88 



Table 2. The left table shows the number of times each agent (with second payoff 
function - Experiment 4) takes the car and his average utility during the experiment. 
The right table shows corresponding results for Experiment 5. 



The left half of Table 2 shows that again the agent with the greater memory 
obtains the best utility, this time recognising to take the bus more frequently. 
It should also be noted that this time, the number of cars being used quickly 
converges to a steady 4 cars per turn which is close to the Nash equilibrium. 

However, in Experiment 5, we again use this new payoff scheme and show that 
increased memory length does not always resultant in greater individual utility 
(right half of Table 2). Again the number of car users in the population quickly 
converges to 4 per turn, but this time, the six agents with the largest memories 
choose to take the bus and score less individually than the four shorter-memory 
agents who choose to take the car. This can be explained if we consider that the 
shorter-memory agents behave ’stupidly’ and take the car because they believe 
it will always give higher utility, whereas the longer-memory agents would like 
to be able to take the car but have the information available to realise that with 
the more selfish agents always selecting the car, they should take the bus. The 
shorter-memory agents score more highly but not as a result of more complex 
reasoning than the other agents. 

3 Experimental Investigation - Additional Information 

In this section, we examine an extended agent model that makes use of an 
additional node which estimates the number of drivers based on an associated 
conditional probability from the previous day’s car drivers. In modelling terms, 
this is different from varying the memory length in that this is a qualitative 
modelling change and not simply a quantitative change. We would expect this 
new agent to be able to take advantage of cyclical behaviour patterns in other 
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agents. The conditional probabilities connecting the node representing predicted 
cars and the parent previous count node are again determined by the statistics 
available in the agent’s memory space, with a uniform prior being used where 
no data is available. Initially, we will experiment with only one ’enhanced’ agent 
in the population, the remaining agents following the simple model. 

We use our original pay-off scheme 

U(car) = 100-(10*m) and U(bus) = 60-(4*m). Agent 1 will be defined as the 
agent using the enhanced model. 

— Average Total Utility: 261.12 

— Average Car Journeys: 0.708 

— Average Bus Journeys: 0.292 



Agent 


Memory 


Takes Car 


Ave. Utility 


1 (E) 


10 


0.51 


28.2 


2-10 


10 


0.73 


25.88 



Table 3. The single enhanced agent (numbered 1) has greater utility than any of the 
other 9. 



We can see (Table 3) that Agent 1 obtains a higher utility than the others, 
and is able to better select when to take the bus when he predicts that the others 
will be following an alternating pattern and will take the car on that particular 
turn. In fact, when the pattern of car use is studied, we find that Agent 1 is able 
to ascertain that following a day when everyone takes the car, the other agents 
will try to take the car again the following day, and takes the bus for maximum 
benefit to himself. 

We now recreate one of our earlier experiments using differing memory lengths, 
combined with all agents using the improved model to see how the results com- 
pare. 

— Average Total Utility: 358.84 

— Average Car Journeys: 0.709 

— Average Bus Journeys: 0.291 

Comparing Table 4 with the previous experiment with only one modified 
agent Table 3, we can see that although the average number of car journeys is 
almost identical, the overall total utility for the whole population is much higher. 
This is caused by the agents being able to select the car at different times, due to 
all agents having more varied levels of information. It is interesting to note that 
in this experiment agents 1, 4 and 5 become regular bus users, but in an earlier 
experiment with the same pattern of memory lengths and older model, it was 
agents 1, 2 and 3 who were the bus users. It would appear that the change of 
model causes a different synchronisation of agent behaviour in both experiments. 
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Agent 


Memory 


Takes Car 


Ave. Utility 


1 (E) 


2 


0.16 


31.66 


2 (E) 


4 


0.96 


37.76 


3 (E) 


6 


0.91 


37.36 


4(E) 


8 


0.10 


31.0 


5 (E) 


10 


0.12 


31.22 


6 (E) 


12 


0.94 


37.72 


7(E) 


14 


0.96 


37.82 


8 (E) 


16 


0.98 


38.1 


9 (E) 


18 


0.98 


38.1 


10 (E) 


20 


0.98 


38.1 



Table 4. When all agents are enhanced, the payoff to the population increases. 



4 Conclusion 

Artificial life simulations of social situations is a new area of research which 
brings the advantages of diversity and parallelism in simulations to situations 
which are not readily analysed. In this paper, we have paralleled a recent set 
of experiments using genetic algorithms to evolve cooperation in a population 
of commuters. However, our agents are probabilistic agents who are totally self 
seeking and yet manage to develop strategies which include accepting lower 
utilities which favours the population as a whole. 

One interesting finding in this paper is that the most powerful commuters 
(those with greatest memories) do not always gain the greatest utilities; they are 
able to see that there is no individual gain in short sighted greed and so they 
rightly opt for bus-use which helps the rest of the population too. This aspect 
of forced altruism is an area of future research. 

We have also shown that agents with a capacity to predict the number of 
car users in any situation is more capable of gaining greater utilities than the 
first simple population. Such agents have meta-capabilities in that they can pre- 
dict the change in car-use in other road users; in particular such agents can 
identify cyclical use of cars/buses. Again this aspect of meta-information is wor- 
thy of future research; in particular, the effect of this meta-information on the 
synchronisation seen in groups of agents’ strategies. 
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Abstract. This paper proposes a new communication architecture which is 
based on a round-table mechanism. Communication channels are 
preliminarily defined according to the matching of agent requests. A channel 
connects an agent to a queue of matched agents with the same interests and is 
scheduled to become periodically active based on their proportions in total 
demand and the amount of available resources. The order to activate channels 
and the sequence of agents in matched queues are defined based on agent time 
constraints. Our evaluation shows that the proposed model achieves a good 
balance of performance and quality of service compared with the other 
methods and is especially useful when the number of agents is large and the 
capacity of systems is limited. 



1 Introduction 

Agent technology is predicted as one of the most efficient tools to conduct business 
via Internet in an automatic, fast, and low-cost way. Softbots are programs which can 
act autonomously to fulfill user tasks. In multi-agent systems, which are based on 
softbots, agents can be distributed on different hosts, they interact and cooperate with 
each other through communication. Thus, agent communication architectures have 
significant influences on system performance and quality of service. 

The development of agent communication systems for agent-based software 
involves: (i) define formal languages for representing commands and the transferred 
information; (ii) design communication architectures which include interaction 
mechanisms and communication models; (iii) develop local planning systems for 
each agent which define when and what commands or information should be 
exchanged with other agents to achieve given goals. While there are many systems 
designed for (i) and (iii), (ii) receives less attention and yet to be developed. 

In this paper we concentrate on communication architectures, i.e. (ii). In the next 
section, we discuss issues of multi-agent communication in softbot-based systems. 
Then, a new architecture for agent communication is described in section 3. In order 
to compare the proposed architecture with the others, an estimation is carried out in 
section 4. Finally, conclusion is given in section 5. 
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2 Communication in Agent-based Softbot Systems 

General layout of a multi-agent softbot system and its communication management 
can be illustrated as in Fig.l. Requirements and goals in developing agent 
communication architectures can be stated as follows: 

Given: n agents Al, A2, An with Q service-request categories C={C1,C2 ,..Cq}. 
Each agent Ai , i=l..n, characterizes hy a set of data (Ti, Si, Di), where: 

• Ti is the period of time for which the agent is scheduled to live. 

• Si shows what services the agent are interested in, their deadlines are also given. 

• Di is other data such as sizes of messages, message box's address etc. 
Repuirements: Design a model and mechanisms for agents {Al, A2, ..., An} to 
exchange messages based on their interests and needs given in Si, i=l..n. The goal is 
to guarantee reliahility while maintaining good performance and quality of service 
such as response time, privacy, and customization. 




Fig. 1. Softbot Communication 

Existing architectures for agent communication can be grouped into the following 
categories: Yellow-pages (YP)[1], Contract-Net (CN)[9], Pattern-based (PB)[4], and 
Point-to-point (PP) [2]. A study in [6] shows that most of them use either fixed 
numbers of communication channels [1],[9] or generates channels based on agent 
requests without any control and consideration of system capacity [2], [4]. Thus, 
when the number of agents is large [1][10], PP and PB make system crush while YP 
and CN give long response times, and all of them suffer from agent starvation. That 
is because these systems use standard interprocess communication mechanisms of 
low-level middlewares or operating systems, which do not consider other information 
about agents, such as their interests, deadlines of requests, or agent life time. 

3 Round-Table Architecture 

To overcome these shortcomings, we propose a new communication architecture. 
Our goals are (i) to take it into account the limited capacity of the host system and 
agent deadlines; and (ii) to achieve a good balance between the workload of agents 
and the workload of communication manager. We propose to have a combination of 
centralized management unit (CMU) and autonomous management (AM) by each 
agent. Besides, the system resources such as memory and CPU's time will be divided 
fairly between agents according to the deadlines of requests and agent live times. 
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3.1 Communication Model 

Our model can be illustrated as in Fig 2. It consists of: (i) Database; (ii) Round Table; 
(iii) Agent Personal Dispatchers (built in each agent). System Data stores: agent IDs, 
pointers to message boxes and their status. For security. System Data can be accessed 
only by CMU, not available for agents. Agent Data is formed at the registration when 
the agent enters the system, and is accumulated based on the information submitted 
to the CMU during agent life. It has the following form: 

• Ai: LifetimeTi, service interest Si={(Ri', ti'), ...,(Ri°', ti*^’)} ; ... 

• A„: Life time T„, service interest Sn={(Rn*, l'), . . .,(R„°", t„°")}. 




Fig. 2. Communication Management Components 



An agent communicates with the others by sending messages. The communication 
management is carried out by Agent Personal Dispatchers (APD) and CMU which 
provide agents two alternatives of communication (Fig.3): (i) Synchronous; and (ii) 
Asynchronous. In asynchronous mode, an agent X sends a message directly to a 
known target agent Y at any time when it needs. This message is stored in the 
receiver Y's message box. In synchronous mode, an agent can use services of Round- 
Table mechanism to create a communication channel to a queue of agents who have 
the same interests. Protocols for synchronous communication are described in details 
in [6]. First, the agent sends a request to CMU, which contains data about his 
interests. The CMU defines the matched queue for the given request. Then, a seat for 
this agent in the Round Table is defined by its own APD. Next, a permanent 
communication channel is automatically established between the agent and the queue 
and is activated by the rules of the Round Table, which are described in the next 
section. Since then, this agent will send/receive messages synchronously within a 
given period of time defined by the Round Table mechanism. Algorithms for APD 
and CMU in synchronous mode are described in [6]. 



3.2 Structure of Round Table 

Round Table is a mechanism which matches agents according to their interests and 
then creates communication channels between the matched ones. Unlike other 
communication mechanisms, communication channels in this model are established 
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with consideration of agent time constraints. Round Table also controls the number 
of channels based on the available resources: threads and memory. Round Table has 
Q double queues of services and virtually a chain of seats (Fig. 3 .). The Q queues {Ri, 
R2, ..., Rq} are formulated based on agent interests given in Si, i=l..n. In each queue 
Rj, j=l..Q, we have two subqueues: (i) R^(j) - a list of agents interested in providing 
the service Cj; and (ii) R (j) - a list of agents interested in demanding the service Cj. 

■ R"a)={{A‘,t/}, {A^t/}, ... {A“ 0 ',t;®}} 

■ R-(j)={{A'*, tj'*}, {A^ ,tf}, ... {A“®*, tj“®*}} 

Where, A'‘ e AS={Ai, A2, ..., Am}, k=l...u(j) or k=l...u(j)*; AS is the set of agents 
who use Round-Table mechanism; tj'‘ is the time constraint for the given request of 
agent A'^ concerning service Cj, either in providing or demanding. 




Di-sgi rttr l«ei' 



Fig. 3. Round-Table for Agent Communication 



Every agent, who wants to use the Round Table for synchronous communication, has 
an entry to the Round Table mechanism. There are M entries to the Round Table at 
the given time. Each entry has its queue of requests which is a list of interests of the 
given agent Ai: L‘ ={L\i], L‘[2], .. Ejm]}, where Hi is the number of requests of the 
given agent, i=l..M. This list is maintained by APD and is sorted based on the 
deadlines of the requests given in Si. On the other side, assume that Me is the 
maximal number of channels, which can be created using the available resources. 
Then, the total number of seats in the Round Table is Me. These seats are distributed 
to the agents by the following law: each queue of service Rj, i.e. a sector of the 
Round table for agents interested in Cj, receives Aj seats which is defined as follows: 



Af = 



Me X Kj 



Q 

1 



where, Kj, j=l..Q, is the number of agents interested in Cj. Thus, Me channels would 
be distributed to Q queues of services by the following rule: 



a Q 

Me = I Af = £ 



Me X Kj 

Q 

y Ki 
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A request from a list of requests of an agent is matched to a queue of service by local 
APD. Receiving this request from APD, the central management unit CMU checks if 
there is a free seat in the matched sector, i.e. queue of service. If so, the seat is 
granted and a communication channel is created. If no more seat is available then 
CMU creates a waiting queue W(j) for the given sector, j=l..Q. Requests from W(j) 
gain seats according to the priorities of requests L‘[h], h=l..Hi, which are defined by: 

PL'm = F{Ti,t[)~ AgeiTi). 

where, F is some function defined by CMU ; Ti is life time of agent Ai; Age(Ti) is an 
aging function which increases priority of an agent by the time the agent name is in 
the system. We use this technique to avoid agent starvation; is the deadline of 
request L‘[h] of agent Ai. 

For each queue of service Rj, j=l..Q, Aj channels are given to agents who have 
shortest life times. The order of potential target agents in a subqueue R’^(j) or R (j) is 
defined by their priorities as the following: 

PTA . 

where, G is a function; Tp is life time of agent A^; f j is the deadline of agent Ap 
interest in Cj; p = l..u(j)/u(j)* . More protocols and algorithms of Round Table 
mechanism are described in [6]. 

4 Comparative Evaluation 

In order to estimate the performance of the proposed model we use the following 
criteria: (i) Cost of EC: time complexity spent for establishing communication 
network, usually for matching agents and filtering messages; (ii) Maximal Number of 
Channels: the possible highest number of channels in the agent communication 
system at a time; (iii) Density: the maximal average number of channels to/from an 
agent. Performance characteristics of PP, PB, CN, and YP methods, described in [7], 
and of the new architecture are shown in Table. 1. 

Table 1. Performance Characteristics 



Methods 


Cost of EC 


Maximal Number of Channels 


Density 


Point-to- Point 


0 


(n-1) Xn/2 


(n-l)—>m* 


Pattern-Based 


Qxn 


Qxn 


Q—> m* 


Contact-Net 


0 


n 


1 


Yellow-Pages 


Qxn 


n 


1 


Round-Table 


Qxn 


Me 


(Q/2+l)—> m* 



(m* is the number of agents which matched the requests of a given agent) 



We use a set of fuzzy values {VL,L,M,H,VH} stand for {Very Low, Low, Medium, 
High, Very High} for measuring quality of service with the following criteria: Agent 
workload; Agent response time; Privacy; Customization. A comparison of PP, PB, 
CN, YP, and RT architectures is shown in Fig. 4. Note that in our architecture an 
agent can have both synchronous and asynchronous communication simultaneously. 
It gives agents more freedom and flexibility. The asynchronous communication is 
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good for one-to-one negotiation, while synchronous communication with Round 
Table mechanism would be suitable for surveys. 



— Point-to-point 
P«')neni-bdse<l 
A Contact Net 
^^^“Yellow |>«iyes 
♦ Rotory T«»l>le 



Workload Response Privacy Customization 

Fig. 4. Fuzzy Comparison of PP, PB, CN, YP in term of Quality of Service 



5 Conclusion 

We have proposed a new architecture for agent communication which considers 
system and time constraints and is able to scale itself to adapt to the limitation 
including the change of system capacity. Thus, this architecture would be especially 
useful in agent-based systems with large size or system running on hosts with limited 
resource. Our analysis and evaluation show that it also achieves a good balance of 
system performance and quality of service. In the future we intend to embed the 
given architecture into an e-business system for mobile services which is proposed in 
[5] by VTT Electronics of Finland. 
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Abstract. Mobile agents are autonomous objects that can migrate from 
one node to other node of a computer network. Due to communication 
nodes failures, mobile agents may be blocked or crashed even if there 
are other nodes available that could continue processing. To solve it, we 
propose a scheme with the path reordering and backward recovery to 
guarantee migration of mobile agents in networks. 



1 Introduction 

Mobile agents are autonomous objects that can migrate from node to node of a 
computer network and provide to users which have executed themselves using 
databases or computation resources of hosts connected by network. To migrate 
the mobile agent, it is needed a virtual place so-called the mobile agent system 
to support mobility [1]. Many prototypes of mobile agent systems have been pro- 
posed in several different agent systems such as Odyssey [2] , Aglet [3] , AgentTCL 

, Mole[5], and so forth. However, most systems are rarely ensuring its migration 
for a fault of communication nodes or a crash of hosts to be caused during 
touring after a mobile agent launches. That is, when there are some faults such 
as a destruction of the nodes or the mobile agent systems, mobile agents may 
be destroyed to block or orphan state even if there are available other nodes 
that continue processing. Because of the autonomy of mobile agents, there is no 
natural instance that monitors the progress of agent execution. 

2 Previous Mobile Agent For Migration 

Mobile agents are migrated autonomously according to the relevant routing 
schedule, and then accomplished their goals. Figure 1 depicts how a node reposi- 
tory can use for implementation instead of transaction message queue for agents. 
Assume that an agent moves from a node to the consecutive node along the path 
N1 N2. . . N(k-l) Nk (where Ni is a network node, Hi is a host, Ri is an agent 
repository). As an agent may visit the same node several times Ni and Nj (l<=i, 
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j<=k) may denote the same or different nodes. Assume further that an agent 
is stored in a repository when it is accepted by the agent system for execution. 
Except Nk, each other node performs the following sequence of operations on 
Transaction Ti such as Get (agent); Execute (agent); Put (agent); Commit. Get 
removes an agent from the node’s repository. Execute performs the received 
agent locally. Put places it on the repository of the host that will be visited the 
right next time. Three operations are performed within a transaction and hence 
consisted of the atomic unit of work. 




Tl ; Execute T2 ; Execute Tj ; Execute Tk ; Execute 

Eig.l.A migration path of a mobile agent. 



In Fig. 1, we assume to happen a failure in a particular node Ni within the 
migration path of the mobile agent. Though the node Ni of the host Hi of node 
Ni lives, the agent can’t migrated. Inversely, though the node Ni can be com- 
municated with the previous node N (i-1), the agent can’t occasionally migrate 
if the host Hi does not operate in the agent system. In the above cases, the 
agent never arrives by the last node Nk. the agent at previous host Hi-I needs 
to receive user’s assertion. In the worse case, if a shared host on the multiple 
agents launched occurs to crash on executing (launching), the agents will block 
or destroy. 

3 Proposed Scheme 

We describe an scheme for the agent system to support reliable migration of 
mobile agents even if it dose happen some failures hosts on the cluster of com- 
puter networks. The scheme adapts ’fault types’ such that agents are not able 
to migrate more continuously. 

3.1 Reordering of The Whole Path 

The mobile agent is impossible to migrate to the destination node by the fault of 
node or host crash. Fig. 2(a) supposes that there is a migration path correspond- 
ing with an agent’s routing schedule and some faulty nodes, such as N3, N4, and 
N7. An agent migrates and executes from node NI to N2 sequentially, but it is 
blocked at the host of node N2 until the node N3 is recovered. If the node N3 
dose not recovered, the agent may be orphan or destroyed by the particular host. 
To solve this situation is for the agent to skip the fault node N3 that includes 
on the migration path and move the address of node N3 to the last one of the 
migration path. Hence, the node N2 successfully connects the next other node 
N4 without any fault. As the same method is also applied to other nodes, the 
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agent’s migration path has reordered. This solution changes the previous migra- 
tion path by connecting with normal nodes except that some nodes have the 
particular fault. Afterward, the agent retries to connect each certain fault node 
after it waits for the time-stamp to assign by the mobile agent system. If the 
certain fault node is recovering during the time-stamp, the agent succeeds to the 
migration. Otherwise, the address of the fault node will be discarded. Fig. 2(b) 
shows that all migration path for the mobile agent is changed by this scheme. 





(b) 



Fig. 2. Faults of nodes on a migration path and reordering 

The Path Reordering executes connecting to communicate with the mobile agent 
system. If the agent doesn’t connect the destination node, it succeeds with con- 
necting the right next node, after the failed address is moved to the last one 
of routing table and that will be retried to connect about the node. When it 
does reconnect each failed destination address, it does wait for the time-stamp 
to be assigned by the mobile agent system to connect. If it does fail again, it 
does ignore this address, and repeat to connect the next fault node. And then, 
if it does adapt to more than twice times failed node, a mobile agent may be 
occurred loophole for connection. So it limits to retry. Although it is connected, 
if each host of nodes errors the mobile agent system, it is adapted equally. In 
this way, algorithm 1 offers automatically to reorder the migration path. 

Algorithm I. Path Reordering 

For each agent’s routing-table { 
extract a target address and fail_checked information; 
if(no more a target address)backward multicastes ’Agent_Fire’ 
signal to successful_target nodes; 
if (is it a f ail_checked_address) { 

wait the agent during some system_timestamp; 
try to connect Socket to the address; 
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if (success) { call goAgent; 
exit ; 

} else { notify to user the address is unavailable; 
ignore the address; 

> > 

else if (not a f ail_checked_address) { 

try to connect Socket to the destination node; 
if (success) { call goAgent; 

exit ; 

} else -[ notify to user; 

move the current f ailed_address to last in the routing-table; 
set the fail_checked information; 

> > > 

3.2 Backward Recovery 

In Fig. 3, we suppose that migrated agents execute autonomously at the host H5. 

If the host H5 of node N5 crashes, all agents at that host are blocked or destroyed. 
To prevent it, when an agent migrates after it ends its job at a previous host, the 
agent’s clone leave equally itself at that host. Then, the clone is unconditionally 
waiting for an acknowledge signal ’ACK’ that reaches from the next host. If 
the signal ’ACK’ doesn’t reach within the time-stamp from the next host H5, 
the cloned agent waiting for at the host H4 has automatically activated since 
it resolves to any hindrance. Then, it is passed by the node N5 and hops to 
the next node N6. If the migrated agent faults at the host H6 on execution, it 
will be repeated the same method. However, the running agent in a host H5 is 
destroyed by being clashed, and at the same time if the prior node N4’s host 
occurs succeeding fault, the cloned agent has already copied the prior host H3 
wakes up and re-runs. This is so-called Backward Recovery. 



Wakeup and Hop 




Fig. 3. An example of Backward Recovery 



The Backward Recovery is as follows: The agent system leave the clone of the 
agent being already passed at all hosts from source to current node and each 
clone is waiting during it’s own time-stamp. Here in, the time-stamp of each 
clone is maximum at source, the next will be less reflecting the migration and 
execution time of the prior, and so forth. Since an agent is launch, it’s time- 
stamp accumulates informing to every clone of the prior hosts it’s own moving 
and running time before it depart for the current host. Therefore, clones are 



348 D. Lee, B. Jeon, and Y. Kim 



waiting during the time-stamp. Each clone spontaneously revives and redoes 
the path reordering regarding that host as clashed if none received any signal 
from next host. At the last node’s host, the agent system broadcasts a signal 
'Agent-Fire' to be destined all copied of the agent except the faulty nodes and 
failed hosts until reaching the destination. 

Algorithm 2. A Backward Recovery 

Wating Clones Check { 
for each sleeped_Clone 
if (empty a Clone_timestamp) notify to user; 
call wakeup Clone; } 
goAgent { 
send the agent ; 

wait the agent’s ’ACK’ signal during send_timestamp; 
if ( ’ACK’) { clones the agent; 

call sleepAgent ; } 
else call wakeupAgent ; 

}■ arriveAgent { send ’ACK’ to the previous_node ; 
execute the agent ; 

} sleepAgent { 

for each cloned_agent { 

add agent-timstamp to system_ time-stamp; 
add the agent to the sleeped_list ; 
sleep the agent ; 

> 

} wakeupClone { 

for the sleeped_list find a cloned_agent ; 
remove it from the sleep_list; 
if ( ’Agent-Fire ’) remove the cloned_agent ; 
else { 

move the current f ailed_address to last in the routing-table; 
set the fail-Checked information; 
call arrangePath at the algorithm 1 ; 

} 

} 



4 An Implementation 

Our scheme is implemented in the JAva Mobile Agent System (JAMAS) that we 
developed. As shows in Fig. 4, the JAMAS consists of Graphic User Interface, 
Agents Mobile Service component. Agents Execution Environment component, 
and Agents Repository to provide the naming transparency of agents. In addi- 
tion, it may be executing one more systems within a host 
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Fig. 4. The architecture of JAMAS 




We show to experiment with an agent which manages some NE (network ele- 
ments). The following figures show the progress that the sample agent as a role 
of MIB (Management Information Base) browser is migrated and executed ac- 
cording to the routing schedule. Fig. 5 depicts the routing path of the sample 
agent such as NEh, NEb, NEa, NEc, and we assume faulty at the host NEb. The 
network manager fetches the prepared agent and specifies routing addresses of 
it to migrate. So, clicking the ’Go’ button on the manager’s window to launch 
it, the agent starts on a tour to get the MIB information of each NE on behalf 
of the network manager. 

Fig. 6 shows screen shots of results of the mobile agent. The agent tracer 
GUI shows what nodes have faulty and how to migrate continuously in the 
network. The executed agent at the host IP address 172.16.53.21 of the first node 
NEh does migrate to the second node NEb. Due to a particular fail, the agent has 
been hopped and migrated at the third node NEa. On completing the execution 
at the last node NEc, it results information of reconnection to the faulted node 
NEb on the reordered path. Finally, Fig. 7 realizes execution of the agent at each 
NE. Fig. 7 (a) as a screen capture of the host NEh, shows hopping by connection 
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failure at the next NEb after the launched agent normally progresses. That is, 
due to fail the host, the agent passes to next one. Thereafter, Fig. 7(b), (c) 
capture executing of the agent at the hosts NEa, NEc. Then it is adapted to the 
our scheme. Therefore, the agent has toured for all nodes having no faults before 
that it does re-connect with the fault nodes. 




Fig. 6. Agent-Tracer GUI 




(a) AScreen shot of executing at the NEh 
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(b) screen shot of executing at the NEa 




(c) A screen shot of executing at the NEc and attempting migration of the 

second at the NEb 

Eig.T.Eault-tolerable execution of a mobile agent at each NE 



5 Conclusions 

We discuss a fault-tolerable scheme with the path reordering and backward 
recovery to ensure the migration of mobile agents in networks. The proposed 
scheme not only affords to avoid faults of communication nodes or hosts of mo- 
bile agents, but also affects to agents’ life span. 
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Abstract. Ontology is an essential element for the agent system. The agent can 
share its knowledge and communicate with each other with it. As the agent 
system is more widely applied, the importance of ontology is increasing. 
Though there were some approaches to construct ontology, it was too far to 
satisfy practical needs. In this paper we have constructed an Ontology Server, 
which provides ontology adapted in electronic commerce (EC), and have 
applied it to comparative shopping system. 



1, Introduction 

Ontology[5,6,7] is essential for the agent system[4,5]. The explicit specification about 
Knowledge can be represented by the ontology. Not only among agents, but also 
between user and system, ontology is crucial for communication and interoperation. 
Though there were some approaches to the construction of ontology[8,9, 10,11], it was 
too far to be applied to a real field. Their ontology was too general and independent of 
any specific domain, so it only described very abstract concept. Therefore we propose 
some characters, which should be held by the ontology adapted in EC[1,2,3]. 

- Ontology can be translated. In EC, there are many shopping sites. To 
communicate and to execute a role, it needs that agent can translate its 
knowledge into another ontology especially in EC. So we decide to construct 
standard ontology, which can be translated into local terms. Of course, inversion 
is also possible. 

- Ontology should be practical. In EC, it is very important how ontology details. If 
ontology presents only abstract concept, then it is not possible for agent to 
perform its part exactly. On the contrary, if its description is too detailed, it is 
hard to gain fully efficiency for the real use 
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2, Ontology feature 

Our goal for an ontology adapted in EC, makes the ontology have some particular 
feature. 



2.1 Domain specific 

First we have tried to build generous ontology, which is independent of domain. But 
Generality hinders expressing fully. It cannot satisfy practical and useful needs. We 
hope that ontology have the power enough to be used in real field, so we determine 
that our ontology is dependent on domain. 



2.2 Ontology type 

We classify ontology by its type on behalf of the use and the convenience. Its 
applying field changes slightly with its types. Types are divided with two axes. One 
of them is about the time of use. it divides into analysis time and search time. When 
searching, ontology is mainly used to build the interface, which can communicate 
between user and agent. In analyzing, agent gathers data and analyzes it. Of course, 
some ontology is used in both times. The other is about how to use. As you noticed, 
there are mainly two input types on Web. One is the subjective input type like text, 
and the other is the selective input type like combo. Fig. 1. depicts the type 
classification and distribution. There are also some ontology lying cross the axis of 
the time of use. 

Use type 




Fig. 1. Two axes for classifying ontology 



2.3 Ontology relation 

There are many synonyms on the Web. But it is hard for agent to understand its 
meaning. To communicate with each other, translation is necessary. We reach a 
conclusion to construct the standard ontology for translation facility. Because it is 
better building central point to connect than giving each terms an ability to change 
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into each other. We designate this as a relation. The relation determines the power of 
performance and expression, so it is requested careful choosing the strategy about it. 
The most important factor of the strategy is the values, included in selective type’s 
ontology. Because the diversity of value is too extreme, it raises a serious problem 
about making a relation. So it has n:n relation. On the other hand, in the same domain, 
ontologies are similar, so it can easily have 1:1 relation. Fig. 2 shows one case in 
which there is a relation between site A and site B, and the relation of values is more 
complex than ontology 



Klin A 
ontology j 

ontolooy I 



'Selective 

.QntoLoflyJ 



Rlir R 
^ontology 

fontology 



'Selective 1 
.ontology. J 



Fig. 2. The relation of ontology between Site A and Site B 



3 Ontology Server 

In Ontology Server, a standard ontology was built. And that must be based on the 
Web site to be applied in the real field and to get the usefulness and practicality. So it 
is necessary the standard ontology has the objective and concrete property. Ontology 
Server provides a manager with the editor. The detailed explanation follows 

1 Gathering from Web 

First of all, we need the local terms used in site. The standard ontology can be 
built based-on that. Gathering Agents are in charge of this process. They collect 
local terms as well as other information, and classify the ontology type. Once 
this process is done, all information is stored in the database. 



2 Making a relation 

As referred previously, making a relation is not only an important job, but also a 
substantial and challenging problem like many other ontology projects. Ontology 
Server provides an editor, which browses the stored information and makes a 
relation. 



3 Modifying or rebuilding the standard ontology 

On making a relation, it may occur that a need of modifying or rebuilding 
standard ontology. Because a standard ontology may have some faults, or new 
ontology may appear. 
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4 Servicing the standard ontology 

After all process is done, the standard ontology is serviced to other agents. And 
the translator is automatically generated for translation. 



4 Implement 



4.1 System architecture 

All system(in Fig. 3) is developed with JAVA, and MySQL is used as a database. 
Ontology Server on Linux machine performs a role of constructing standard ontology 
and servicing it. Fig. 4 shows an editor with two panels. On the left the current 
standard ontology is displayed and on the other panel local terms is presented. With 
that a manager builds a relation. All information are stored in Ontology Server. User 
Agent executes a search by user’s request. Gathering Agent residing in server-side 
gathers the relevant information from Web, and analyzes it. Fig. 5 presents Gathering 
Agent, which analyzes one game site. 




Fig. 4. An example of editor in Ontology Server 
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Fig. 5. An example of Gathering Agent’s view 



4.2 Search process 

The search process is similar with the traditional real-time comparative shopping[2,3]- 
But the interface changes dynamically with user’s choice. So a user can have a lot of 
search functions like selection. But the ordinary system only provides a keyword 
search. A user can not only search more conveniently and precisely but also get more 
abundant result. Because the description of site’s product attribute is stored in the 
ontology server, a user agent can analyze search result with it Traditional system only 
shows minimum result like name and cost. When user’s choice is determined, User 
agent converts it into local forms fitted in each site with translator. And adversely the 
result of site is transformed into standard ontology. User agent shows this result to the 
user. So user can get it more fluently. This process is described in Fig. 6, 




Fig. 6. The execution of User Agent 



5, Result and related Work 

We show that the adapted ontology in EC, is applied usefully. User interface is 
changed dynamically as domain changes, therefore the search can be achieved more 
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precisely, and result has more attribute than that of ordinary comparative shopping 
system with information in ontology server. A user gains more profit and, reduces 
time and effort to search. While our system proves that ontology’s performance and 
application in EC is remarkably successful, there are also revealed a number of 
limitations. The ontology relation and standard ontology needs hand-coding, and it is 
a chronic problem as other ontology projects have. It may be short from objectivity. 
Needless to say, WWW is less agent-friendly, so Gathering Agent has a trouble in the 
analysis. 
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Abstract. In this paper, we suggest, design and implement a cyber banking 
process and settlement system for Internet-commerce. The proposed system 
employs the concept of OPOI(One Process One Input), the basic concept of 
Korean BankERP System we have already developed. The system can be used 
for all kind of transaction like B2B, B2C and C2C. We have actually applied 
the system to handle real-world transactions by an alliance with a major leading 
bank in Korea, and confirmed its effectiveness 



1. Introduction 

With the explosive proliferation of Internet, E-Commerce (EC) has drawn the 
attention of the most part of company and customer as either infrastructure or 
business model for all kind of business transaction. In actual fact the EC brings a lot 
of benefits to both buyer and seller in an economic aspect) 1]. In spite of those, 
however, the EC has some critical problems such as the complexity in payment 
settlement system. One of the most important elements on Internet e-business is 
cyber banking process and settlement system. The existing system requires 
the sellers for management on the system and network cost that make the 
indirect cost increase. By this reason of the cost from the process and settle 
comparably high imposed on product/service price, the existing system is not 
proper to act for the small amount transaction and limitedly applied not for 
the whole range of transaction!!]. The well-known existing systems are as 
follows, which have the same common problems mentioned above: 
Digicash[3, HRER 1], Cyber Cash [HRER 2], Mondex [HRER 3], Enipay 
[HRER 4], Netcheck [HRER 5], e-check [HRER 6], SENB [HRER 7], 
TeleBank [HRER 8], CYBank [HRER 9], egg [HRER 10], fleetBoston 
[HRER 1 1 ] and so on. 

The Cyber Banking Process and Settlement System suggested in this paper 
is adopted as a Korean BankERP System and a banking system model 
currently. The proposed system is to act as a broker between seller and 
purchaser due to its convenience, safety and reliability in use. So it can 
effectively assist all kind of transactions like B2C, B2B, C2C and so on. It 
gives advantage to the seller reducing the direct/indirect cost of system 
management, network, 0&M(Operation and Maintenance) and etc[5,8]. Its 
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low commission rate makes the small amount transaction active. It allows the 
purchaser to simply access the system without use of card or additional 
equipment. It is the first system designed and implemented in Korea for 
Cyber Banking Processing and Settlement on the Internet e-business. 

In Chapter 2, a Cyber Banking Process and Settlement System is suggested, 
Chapter 3 describes the system architecture and functions. Chapter 4 
describes implementation and evaluation. Chapter 5 is conclusion. 



2. Business Process of Cyber Banking Process and Settlement 
System 

Internal integration business process and settlement business process of the 
system is defined as follows: 



2.1 Internal Integration Business Process: OPOI 

The applicable BankERP of this system has OPOI (One Process One Input) 
concept, which process by one input of account date and management data. 
Bank has various processes of traditional deposit, loan and credit, 
import/export, fund and head office/branch and etc. The bank processes may 
be divided into two categories - account process and due-date process. The 
account process is based on the debit and credit to the account and due-date 
process has time deposit, installments, loans, import and export and fund and 
etc. These process types are shown in Figurel. 

A Type„ Account process 
B Type: time deposit, loans, funds 
C Type: import, export, foreign 
remittance 

D Type: installments, 

E Type: trust 

X Type: other processes, settlement of 



Fig. 1. Knowledge Map of the OPOI 

(1) A type of process is transferable to A, B, C, D, E and X type of 
process. 

(2) B, C, D and E types of process are not transferable each other but 
only to A type. 

(3) X type of process is transferable to A and X type. 
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2.2 Business Process of Cyber Banking Process and Settlement System 

The system consists of four parties; sellers (shopping malls and others), 
purchasers, strategic alliance hanks and system owner who operates the 
system. Each party’s function is as below: 

(1) Operating Owner of Cyber Banking Process and Settlement System: he 

will establish servers, DB and network necessary to the system operation. He 
will open cyher accounts of sellers and purchasers upon their 

request/subscription and processes the payment and settlement. He shall 
provide with firewall, cryptograph and insurance securing the safety and 
reliability of system, so that the system will have security solution. In 
strategic alliance with traditional banks, he shall prepare reservation fund 
account in the name of the owner at the strategic alliance bank that supports 
users prompt and safe cash in-out. The owner and strategic alliance bank shall 
come into a contract for fund management in order to make safe management 
about the money income to the reservation fund account. In case of buy & 
sell, some trading commission shall be imposed to the seller. 

(2) Seller: Seller shall open account to receive the payment from purchaser 
in the System. The seller may request to debit after reasonable period of time 
from the account 

(3) Purchaser: Purchaser shall open account in it for the purpose of 
payment and credit cash amount to the account via existing traditional hank - 
account transfer, home banking, internet banking, etc. When he credit to the 
account he shall inform to the owner. When he needs to debit, upon his 
request the money will transfer from cyber account to the real account in the 
traditional bank. 

(4) Strategic alliance bank: the bank shall open reservation fund account of 
the system by the contract between the owner and assist the money in-out of 
seller and purchaser on the real time basis and control the owner’s debit at 
front office of bank. 

The above four parties’ process flows are as follows (see Figure2): 

© Cyber account open by seller and purchaser 

© Purchaser’s cash credit to the reservation fund account at the alliance 
bank and notice to the owner his credit. The alliance bank informs to the 
owner real-timely the transaction of debit and credit of the reservation fund 
account. 

© Purchaser buys from and/or subscribes to the seller’s site 

® Purchaser accept the payment of product/service to the owner’s site 

® Owner makes the payment transfer from purchaser’s account to seller’s 
and notify and confirm its transaction to sell. 

® Seller delivers the product/service to purchaser 

© Seller requests the payment to his designated bank account. 

® Purchaser requests the debit his amount in cyber account to his 
designated bank account 

® Owner submits payment order to alliance bank to make money transfer 
from the reservation fund account to the requesting seller and purchaser’s 
account. 




364 M.-S. Kim and E.-S. Lee 



® Alliance bank does transfer to the designated bank account of seller and 
purchaser upon the owner’s payment order. 

O Alliance bank notifies daily transaction details of reservation fund 
account to owner. Owner shall check daily correspondence between the 
balance amount of the reservation fund account and whole accounts at cyber 
bank. 




Fig. 2. Cyber Banking Process and Settlement System 



3. System Architecture 

The system architecture (see figure 3), as a type of 3 ties client/server[10], 
consists of WEB server (see figure 4), Application server and DB server (see 
figure 5). WEB Server consists of WWW Server (Apache Server), WEB 
Page, Security certification server, WEB Brokerage Application Server and 
Service distribution Gateway. Application Server consists of account transfer 
system, account information system. DB Server consists of Data DB, Memory 
DB and History DB. 1 tie is accessible by users via web browsers (Netscape, 
Explore) on the internet, 2 tie has WEB Servers, WEB Gateway, WEB 
Service Broker, Eirewall, Security Certification support the user’s safe use. 
Application Server run financial application program, 3 tie is RDBMS that 
stores all the data of users in the data base. Language are JAVA, UNIX, 
CCGl and HTML, operation system UNIX, LINUX and Window98, Database 
ORACLE RDBMS, Security System is 128bit SSL[6,11, HRER 12 ] model. 
Network protocol is TCP/IP for internal server telecommunication and X.25 
for alliance bank telecommunication. 
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Fig. 4. Web service Architecture 



Fig. 5. Web Trading Architecture 



4. Implement Business Process of Cyber Banking Process and 
Settlement System and evaluation 



4.1 Implementation 



The process has been implemented with application of BankERP System that 
this paper mainly concerned. Application Design is using Power Designer. 
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Major process of the system (1) Cyber account inquiry (2) money transfer 
from bank account to Cyber account (3) transfer from cyber account to banks 
account, (4) transfer from cyber account to cyber account (5)Payment at 
seller’s site. The details as follow: (1) Cyber account check is to inquire 
account in and transaction details and balance periodically (2) money transfer 
from bank account and cyber account is to input cash to the cyber account. (3) 
Money transfer from cyber account to bank account is to seller’s payment out 
put and user’s cash payback. (4) Transfer from cyber account to cyber 
account is for a user’s money transfer within the cyber bank (5) at Seller’s 
site, purchaser does one click payment from cyber account after user purchase 



4.2 Evaluation 

Cyber Banking Process and Settlement System this paper suggests is to 
support convenient, safe, prompt and reliable banking settlement in the 
internet EC. Since the system is in the beginning stage of commercial 
practice, it is not easy to physically evaluate the effectiveness but possible to 
do by comparing with other existing systems. The other existing system is 
based on real account at the bank via internet tool and limited in its service to 
simple credit, deposit, loan and transfer. They are found not enough to 
represent as settlement system for EC. They are not allow to monitor real- 
timely account in-out and impose high commission rate in small amount 
transaction. Credit Card system is open to the risk of card and personal data 
reveal and malicious use by others. And it also impose high commission rate 
to its users and is inconvenient to possess and possible to miss. In the contrary 
Cyber Banking Process and Settlement System is utilizing cyber account and 
make settlement ready by One Click on the internet. It assists the seller and 
purchaser to direct trade and commission rate is comparatively lower than 
others. It is very easy for seller and purchaser to be equipped with system. It 
has most less problem. Except credit card system, all other system is not able 
to deal the deferred payment but the system is able to do like credit card 
system. All other system does not allow the payback of balance amount in the 
account but the system allow the payback at any time as the user wants. The 
system does need no additional equipment or facility but only need Web 
Browser to access internet, which lessens the user’s investment and 
maintenance cost and more economical than other systems. The low rate of 
commission rate encourages the transaction to be active whatever the amount 
is big or small. Since the system begins its service with a strategic alliance 
with major leading bank in Korea, it is proven its effectiveness. This system is 
the first system applied Korean BankERP which is possibly to be also proven 
its effectiveness. So far its service has been shown quite satisfactory. 



5. Conclusion 

This paper suggests, designs and implements the first Korean business 
process model of Cyber Banking Process and Settlement System which has 
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been in commercial operation and increased its use-ability by an alliance with 
major bank in Korea. The system gives lot of benefits to the users; for seller it 
is necessary for no additional equipment or facility but low rate of 
commission and safe and prompt settlement to its cyber account 
simultaneously at the trading, for purchaser it supports easily to access by 
only web browser and make payment safely and promptly and also be 
guaranteed for any possible loss at the procurement by the system, for alliance 
bank, it operates the reservation fund which gives certain profit to the bank 
and it warrants reliability to other user. It is confident the system will 
represent as a model of cyber settlement system and contribute EC to actively 
expand in the near future. The next mission is the first cyber bank 
establishment upgrading the system by Korean BankERP application in 
combination with off-line and on-line banks. 
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Abstract. This paper proposes a shopping agent with a robust induc- 
tive learning method that automatically constructs wrappers for semi- 
structured online stores. Strong biases assumed in many existing systems 
are weakened so that the real stores with reasonably complex document 
structures can be handled. Our method treats a logical line as a basic 
unit, and recognizes the position and the structure of product descrip- 
tions by finding the most frequent pattern from the sequence of logical 
line information in output HTML pages. This method is capable of an- 
alyzing product descriptions that comprise multiple logical lines, and 
even those with extra or missing attributes. Experimental tests on over 
60 sites show that it successfully constructs correct wrappers for most 
real stores. 



1 Introduction 

A shopping agent is a mediator system that extracts the product descriptions 
from several online stores on a user’s behalf. Since the stores are heterogeneous, 
a procedure for extracting the content of a particular information source called 
a wrapper must be built and maintained for each store. A wrapper is generally 
consists of a set of extraction rules and the code to apply those rules [5]. 

In some systems such as TSIMMIS[4] and ARANEUS[2], extraction rules for 
the wrapper are written by humans. Wrapper induction[5] has been suggested to 
automatically build the wrapper through learning from a set of resource’s sample 
pages. However, most previous systems were unable to cover many real stores 
since they relied on some strong biases, imposing too much restrictions on the 
structure of documents that can be analyzed. Eor example, ShopBot[3] assumes 
that product descriptions reside on a single line, and HLRT[5] can not handle 
the cases with noises such as missing attributes. STALKER[6] algorithm deals 
with the missing items or out-of-order items, but it is not fully automatic in the 
sense that users need to be involved in the preparation of training examples. 
ARIADNE[1] is a semi-automatic wrapper generation system, but its power of 
automatic wrapper learning is limited since heuristics are obtained mainly from 
the users rather than through learning. 
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In this paper, we propose a shopping agent with a simple but robust inductive 
learning method that automatically constructs wrappers for semi-structured on- 
line stores. Strong biases that have been assumed in many systems are weakened 
so that real-world stores can be handled. Product descriptions may comprise mul- 
tiple logical lines and may have extra or missing attributes. Our method treats 
a logical line as a basic unit, and assigns a category to each logical line. The 
HTML page of a product search result is converted into a sequence of logical 
line categories. The main idea of our wrapper learning is to recognize the position 
and the structure of product descriptions by finding the most frequent pattern 
containing the price information. This pattern is regarded as the extraction rule 
of the wrapper. 

2 Overview of Comparison Shopping Agent 

Our wrapper learning method is implemented in a prototype comparison shop- 
ping agent called MORPHEUS. The overall architecture of MORPHEUS is 
shown in Fig. 1. It consists of several modules including the wrapper genera- 
tor, the wrapper interpreter, and the uniform output generator. 




Fig. 1. The overall architecture of MORPHEUS 



The wrapper generator is the main learning module that constructs a wrapper 
for each store. In fact, the wrapper generator learns two things. First, it learns 
how to query a particular store by recognizing its query scheme. An HTML page 
containing a searchable input box is analyzed and a query template is generated. 
Second, it learns how to extract a store’s content. Product descriptions in the 
store’s search result pages are recognized and their repeating pattern is deter- 
mined. The wrapper interpreter is a module that executes learned wrappers to 
get the current product information. This module forms several actual queries by 
combining each store’s query template with the keywords that the user actually 
typed in, and sends them to the corresponding shopper sites. The search results 
from the stores are then collected and fed to the uniform output generator mod- 
ule. The uniform output generator integrates search results from several stores 
and generates a uniform output. 
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3 Learning Wrappers for Online Stores 

One key function of the wrapper generator is to learn the format of product 
descriptions in result pages from successful searches. Each page contains one or 
more product descriptions that matched the sample query. A product description 
is composed of a sequence of items that describe the attributes of the product. 
For example, a bookstore displays search results in which the attributes of a 
product include the booktitle, the author, the price, and/or the reader’s review. 

Wrapper learning has to find the starting and ending position of the list of 
product descriptions in the entire result page, and to recognize the pattern of a 
product description. To do this, our method is divided into three phases. 

In the first phase, the HTML source of the page is broken down into logical 
lines. A logical line is conceptually similar to a line that the user sees in the 
browser, so the algorithm recognizes each logical line by examining HTML’s 
delimiter tags such as <br>, <p>, <dd>, <hr>, <table>, <td>, and <tr>. 

The second phase of the algorithm is to categorize each logical line and as- 
sign it the corresponding category number. Currently, we maintain 5 categories 
including TEXT, PRICE, LTAG, TITLE, and TTAG, and their category num- 
bers are 0, 1, 2, 3, and 8, respectively. Here, TITLE denotes the product name, 
PRICE denotes the price, TTAG denotes table tags such as <tr>, LTAG denotes 
the HTML tags other than TTAG used in logical line breaking, and TEXT de- 
notes a general string that is not recognizable as one of the above four categories. 
We use simple heuristics for this category assignment. For example, TITLE is 
assigned to a logical line when the line contains one of the keywords in the 
sample query, and PRICE is assigned by recognizing a dollar sign $ (or some 
other symbol that represents the price unit) and a digit. Fig. 2 shows the HTML 
source of a product description that is obtained from the Amazon bookstore by 
the query “Korea”, along with assigned category numbers for logical lines. 



<a href='7exec/obidos/ASIN/3540618724/qid=95863 


3 
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<br> 
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by K. Kim(Editor) , Tsutomu Matsumoto (Editor). Paperback (November 
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1996) 




</td> 


8 


</tr> 
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<tr> 


8 


<td valign=top width=50*/,> 
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<font face=verdana,arial jhelvetica size=-l> Our Price:$73.95 


1 



Fig. 2. An HTML source for a book and the categories for logical lines 



After the categorization phase, the entire page can be expressed by a sequence 
of category numbers. The third phase of our algorithm then finds a repeating 
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pattern in this sequence. It first finds the pattern of each product description 
unit(PDU) and counts the frequency of each distinct pattern to get the most 
frequent one. Finding the next candidate PDU is done by searching for PRICE 
first, and then backtracks in the sequence to search for TITLE, despite it is gen- 
erally assumed that the TITLE attribute appears before the PRICE attribute 
in a PDU. This is because the reliability of the heuristics for correctly recog- 
nizing PRICE is higher than that of the heuristics for recognizing TITLE. The 
subsequence of logical lines between TITLE and PRICE becomes the resulting 
pattern of a PDU. A pseudocode for this algorithm is given below. 

seq t— the input sequence of logical line categories-, 
seqStart t— 1; /* initial position for pattern search */ 
numCandPDUs t— 0; /* number of distinct PDU pattern */ 
while (true) { 

/* find the next candidate PDU */ 

pricelndex t— findlndexfseq, seqStart, PRICE); 

titleindex t— findlndexReversefseq, pricelndex, TITLE); 

currentPDU t— substringfseq, titleindex, pricelndex); 

if (currentPDU == NULL) then exit the while loop; /* no more PDUs */ 

if (currentPDU is already stored in candPDUs array) 

then the frequency count o/ currentPDU is incremented by 1; 

else { save currentPDU in candPDUs array; 

increment numCandPDUs by 1; } /* a new PDU pattern */ 
seqStart <— priceIndex-l-1; /* starting point for searching next PDU */ 

} 

mostFreqPDU t— the element of candPDUs array with maximum frequency; 
retnrn(mostFreqPDU); 

For Amazon, the learned PDU pattern is 32088881 as shown in Fig 2. In the 
shopping stage, the wrapper interpreter module applies the learned PDU pattern 
to modify noisy PDUs with different attributes by ignoring extra attributes or 
putting dummy values for missing attributes. 

4 Experimental Results 

We implemented MORPHEUS and built a Web interface as shown in Fig 3(a) 
so that the user can select a store that is to be learned. Learned stores are added 
to the store list from which the comparison shopping can be done. 

To evaluate the performance, we have tested MORPHEUS for 62 real online 
stores as to whether correct wrappers can be generated. We assume that a proper 
test query is given in the learning phase so that the output page with reasonably 
many matched products can be produced. In order to verify whether the correct 
wrapper is generated, the result of wrapper learning is displayed as in Fig. 3(b). 
In this display, the learned PDU pattern along with the product names and their 
corresponding prices are shown. If this data is consistent with the one that is 
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Fig. 3. MORPHEUS Interfaces 



obtained by directly accessing the store site, it can be regarded that the correct 
wrapper is really generated. 

Table. 1 shows the test data for some of the 62 sites that have been tested. 
During the test, we have collected some relevant information for each output page 
of the site such as the test query used in learning, the learned PDU pattern, the 
number of PDUs, and the number of MF-PDUs(most frequent PDUs). In this 
experiment, the proposed wrapper generation algorithm works satisfactorily with 
succeeding in 58 out of 62 stores. A few sites such as www.dsports.com failed 
to get a PDU pattern since it contains some unnecessary product information 
in the header of the output page. 



Table 1. Experiment data during wrapper generation for some of 62 sites 



Store URL 


test query 


PDU pattern 


No. of No. of 

total PDUs MF-PDUs 


www.more.com 


gift 


388088088088121 


7 


7 


WWW . j e welry web . com 


ring 


32022228880881888808821 


18 


18 


www.softwarebuyLine.com 


school 


320808020808088081 


40 


29 


www.lcache.com 


video 


32088021 


10 


10 


www.etronixs.com 


video 


32088021 


10 


10 


www.egghead.com 


Compaq 


38080881 


44 


36 


www.bookbay.com 


java 


3202021 


17 


17 


intertain.com 


java 


321 


133 


133 


www.more.com 


gift 


388088088088121 


7 


7 


WWW . amaz on.com 


java 


32088881 


50 


26 
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5 Discussion and Future Work 

We have developed a robust method for automatic wrapper generation in the do- 
main of comparison shopping, and the test results have shown that it successfully 
constructs correct wrappers for most real stores. 

The characteristics of our method in comparison to previous researches are 
summarized as follows. First, the strong biases assumed in many existing sys- 
tems are weakened so that the real stores with reasonably complex document 
structures can be handled. Second, we do not exploit the domain knowledge. 
This makes the learning algorithm simple and domain independent, and it still 
works satisfactorily. Third, learning in MORPHEUS is processed quickly since it 
does not incorporate a separate module for removing redundant fragments such 
as the header, tail, and advertisements. 

There are also some limitations in our current system. First, we have as- 
sumed that a proper keyword is given for the test query by humans. Heuristics 
for providing a proper test query automatically should be investigated. Second, 
each product description must contain the price attribute. We think that this is 
not a severe restriction since most stores that produce semi-structured product 
information contains the price attribute, with only a few exceptions. Nonethe- 
less, this restriction may reduce the generality of the algorithm since it cannot be 
applied to other domains that do not require the price information. One solution 
might be that the feature attribute that must exist in a product description may 
be specified as a parameter to the algorithm, rather than hard-coded in the pro- 
gram. Third, we only extract the price information from a product description 
that may contain several other attributes. Extracting non-price information by 
exploiting proper domain knowledge(or the ontology) is under progress. 
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Abstract. This paper presents new services for intelligent monitoring and visualising 
user accesses to a university’s web site. These are based on the use of data mining 
techniques to process data recorded in the web log files and visualise it. The system is 
to be used by the University of Nottingham's web server to monitor the interests of 
potential students and to predict the student numbers and their geographical 
distributions for the next academic year. The system could be adapted for other 
academic and commercial web sites to assist intelligent decision making. 

1. Introduction 

This paper reports the innovative research that applies data mining techniques to the 
data collected from web log files access to discover subtle relationships and patterns 
to assist decision making. Recently there have been many successful applications of 
data mining techniques to industrial and business domains. However little is reported 
of data mining applications in academic institutions at the time of writing. 

The research aims at applying data mining techniques in the context of academic 
institutions. At the centre of the research is a real-time web traffic analysis and 
monitoring system for the University of Nottingham’s web server. The research seeks 
to transform the web into an environment where users are aware of the presence of 
each other and the web server is aware of the presence and the interests of its users so 
to monitor and control user accesses more effectively. The system provides graphical 
and textual representations of web access by users from all over the world. Our work 
has been motivated by the need of universities to know the interests of their users and 
the access demands for their web servers at different times round the clock, so to 
provide better services for the users. Monitoring web access not only allows a 
university to know its potential students but also to know the strength and limitations 
of its server structure. It is important for universities to know their who potential 
students might be and what they are interested in the university. 

The University of Nottingham excels in its teaching and research as one of the leading 
academic institutions in the UK. The number of potential applicants to the university, 
home or overseas, is ever increasing. More and more potential applicants are using the 
University’s web site to get information about courses and other matters. However at 
the moment the only way to estimate users’ interests in the University is by counting 
the number of hits of the University’s web pages. This can be misleading as 
undoubtedly the hits could have been due to some casual users exploring the web. The 
access counters alone cannot tell what user's interests are. 

K.S. Leung, L.-W. Chan, and H. Meng (Eds.): IDEAL 2000, LNCS 1983, pp. 374-379, 2000. 

© Springer- Verlag Berlin Heidelberg 2000 




Real-Time Web Data Mining and Visualisation 375 



To analyse user access we start with the web server log files. Though these server log 
files are rich in information, the data is itself usually in plain text format with comma 
delimiters, abbreviated and cryptic. It is difficult and time-consuming to make sense 
of it. The volume of information is also overwhelming: a 1MB log file typically 
contains 4000 to 5000 URL requests, along with the IP address of the request, the date 
and time of the request, and the encoding language of the users’ browser. 

2, Real-time Web Access Data Visualisation 

The first step is to transfer all the entries of the web log file into a SQL database. No 
information was lost in this process. Data mining programs are then applied to the 
database to visualise the relevant data and extract its meaning. 

2.1. Data Visualisation Against Map Images 

Users are classified according to their geographical distributions. Because the users 
may not have registered with the University so their distributions have to be estimated 
according to the difference of users’ local time and the time of the server. The users 
are then placed into appropriate time zones and displayed against a GIS map, as 
shown in Figure 1 below. The database is updated each time a new access request is 
received by the web server and the screen is refreshed with the new entry added to the 
appropriate time zone. For registered users of the web site, their countries and their 
exact locations in the country is known so they can be added to the total numbers in 
the locations. 




Figure 1 . User distributions 



2.2 Data Visualisation Using Dynamic Mapping 

The visualisation part of the system was also implemented using ArcXML to create 
dynamic mapping. It is no trivial task to create dynamic mapping of real time data 
onto a GIS map and automatically display it on the web. The usual way of doing this 
is to hard code the hyperlinks on a map image. Web browsers only support FITML 
and XML documents, they can’t display GIS layers. 

We use ArcXML and ASP to create dynamic mapping of data onto a GIS map and 
automatically display the map on the web, see Figure 2. 
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Figure 2. Dynamic mapping for hits display 
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First we developed a simple tenderer (a dynamic chart) using ArcXML. It reads ASP 
variables to provide the framework for filling polygons, drawing lines, displaying 
points, symbols and text labels. Example of symbols and text labels are shown in 
Figure 3. 
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Figure 3. Display layers 

GIS software can superimpose different layers on top of one another and display them 
all on a map. The problem is ArcXML only allows one symbol layer or one label 
layer to be mapped onto a tenderer. Also web browsers can display one tenderer on 
the same position of a page. This is the reason why the system uses ASP via ODBC to 
retrieve data from the SQL database for the tenderer, rather than using ArcXML to 
call the data directly. For example, if there are two data items, one is ‘Glasgow’ for 
creating a label and one is the number of hits 4464 for creating the symbol (pink 
coloured circle). A tenderer can read only one column of data from the database. If it 
reads ‘Glasgow’, it will not be able to read the number of hits. Using an ASP variable 
with value ‘Glasgow4464’ the system creates a tenderer display and both the label 
and symbol layouts are displayed on the tenderer. 

2.3 Visualising Failed Access Attempts 

Apart from displaying successful access of a web site, those waiting or failed attempts 
to access the web can also be displayed by making use of the information stored in the 
error log file. Figure 4 shows textually the successful and failed accesses to a few web 
pages of the web site over a period of 60 minutes. Each of the three digit numbers in 
the columns represents up to 10 users currently using the web site. Some web servers 
and databases have a limit to the number of access allowed at the same time. Page 
usage analysis is useful for identifying user interest and for improving the site design: 
pages attracting no traffic may be removed and some resources may be clustered to 
improve network traffic. 
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Figure 4. Page accesses 



2.4 Visualising Web Server Usage 

Similarly server traffic and the access behaviour of each user can be displayed and 
rules concerning user interests can be derived based on his/her access display, see 
Figure 5. Users in each geographical location are further classified into different 
subject categories such as education, MBA, IT sectors etc., to assist user interest 
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Figure 5. Page hit account and the top 4 course web pages 



We broke down web server log report into two categories: information for web 
administrators and information for the administrators of the University or its 
departments. Web administrators need information such as what pages are most often 
accessed and what links are followed and where the problems lie. The administrators 
of the University are more interested in attracting users with the web contents. 



3, Data Mining 

Data mining techniques can be categorised into three categories: classification, 
association, and sequence. Extracting association rules from data allows to see the 
relationships hidden in the data. For example, the presence of some IP addresses in 
the log file may imply the presence of other IP addresses. The role of classification in 
data mining is to develop rules to group data together based on certain common 
features. 

3.1 Classification. Available data for this is IP addresses discovered from the log 
database. Classification rules classify IP addresses into two groups. Let T 2 be a set of 
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IP addresses belong the University, be a group of IP addresses which do not belong 
to the University. 

Ti = Non-university IP address 

T 2 = University IP address (128.243.00.00) 

Also, the languages of users’ browsers could be categorised into different groups. Let 
L be a set of languages: L; = UK English, L 2 = US English and Lj „ = other 
languages. 

3.2 Association. This involves rules that associate one attribute of a relation with 
another. Eor example, an association rule could be of type (Ti) AND (L;) (URL = 
ENGLISH.HTML ). This rule associates the IP address and the language of the users’ 
browser to the type of courses he/she is interested. Other example associations are 

IF U(i 4 ) = Overseas users Course site including English support (225.225.12. 33) 

IF Ufi ij = Local users Course site without English support (199.127.0.10) 

In table 1 below there is a user whose IP address is 225.225.12.33 and whose browser 
encoding language was Spanish and the time difference was 1 hour. This means that 
the user is from another country who wanted to know the detail of our MBA courses 
and English support. The user whose IP address is 199. 127.0. 10 and whose browser 
encoding language is gb-en (UK English) and no time difference is a local user, and 
he/she is more interested in the service of the University and the course modules. 



Table 1. U(,j) Log Eile Database 



Crone of IP addresses 


URL 


128.243.233.23 


Modules, Society 


225.225.12.33 


English, MBA, CSIT 


I99.127.0.I0 


Modules, Eibrary, Sports, IT support 



* Visa set of URL, ? is a group of IP address and / is a group of browser encoding languages 



We know that certain users will visit a page and will not continue traversing the 
hyperlinks contained in that page. For example, IT courses are as popular as MBA 
courses, but MBA fees are much higher than IT courses, so over 70% of the users 
were directed to IT courses instead. All these analyses provide an insight into the 
behaviour of users and the usage of the site’s resources. 

3.3 Creating Rules. Previous work on quantifying the “usefulness” or “ interesting” of 
a rule focused on how much the support of a rule is more than that of the support of 
the antecedent and the consequent of the rule. We implemented Piatetsky-Shapiro’s 
idea, as Y is not interesting if support (X ^ Y) ~ support (X) x support (Y). 
Consider the following rule: 



U(i^ 4 ) English Support 
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If “Non-University’s IP addresses” is a parent of “English Support”, and over 15% of 
users of “Non-University’s IP addresses”, we would expect the rule: U(i 4 ) English 
Support to have 2% support and 70% confidence. If the actual support is around 2% 
and 70%, respectively, the rule can be considered redundant since it does not convey 
any additional information and is less general than the first rule. 

The problem of discovering generalised association rules can be decomposed into 
four steps as show below. Let I be a set of IP addresses, II the University IP address, 
12 the set of other IP addresses. 

• Determine Lj (language) and / from the log database with a minimum support a 
This is to find UK visitors regardless they are current students of the University or 
not. 



• Generate the candidate 2-item set C 2 from Ii and I 2 . 

This is to create two child nodes, one for II with LI, the other for 12 with LI 

• Let F be a set of frequent URLs, which are not used to generate C 2 . Scan ah of 
database tables to decide the real Lj and obtain L 2 from C 2 . If a transaction 
belongs to F, then it is filtered out for the information of any 2-item set in L 2 , 

• Perform the remaining steps in the same way to find L„ for n>=2 

To determine the locations of visitors whose language is in L2. This involves 
comparing time differences. Repeat to create the next two nodes (C2). 
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Abstract. A flexible constraint-posting architecture for Web site Alter- 
ing from the field of knowledge representation has been transferred to 
the domain of Web constrained browsing. The architecture defines an 
interpreter that accepts declarative constraint formulae which it uses to 
Alter urls to which declarative actions are applied. 



1 Introduction 

A walker is a computer program that systematically browses the Web, building 
indexes as it follows every link it can find. Walkers can read about a thousand 
pages per second, and many of them read every word on every page. The walker 
then feeds the information it has gathered into a searchable database. Probably 
the two most useful of these types of search engines are Alta Vista and HotBot. 
These are both great for ” needle-in- a-hay stack” type searches - very specific 
information like exact quotes, phrases or names of people and places. 

In Mallery, 1996, a Web walker for the Hypertext Transfer Protocol (HTTP) 
was implemented using a control architecture. W4 uses a declarative and ex- 
tensible vocabulary of constraints to characterize traversals of Web structures. 
Starting from a root resource, the walker recursively follows all hyperlinks whose 
associated resource satisfies the constraints guiding the walk. As the walker tra- 
verses the structure it performs operations that are specified in a declarative and 
extensible action vocabulary. 

The purpose of this paper is to describe how, for research and pedagogical 
reasons, we reengineered and reused i\AsW4 filtering walker technology to con- 
strain user’s browsing for a local experimental educational Web site. Section 2 
and 3 explains the logical and software framework. Section 4 gives a detailed 
description of the filtering mechanism. 



2 CL-HTTP 

It is known that a Web site is accessible to the users via an Internet access and 
software called Web server. Due to the nature of the distance supervision problem 
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that requires intelligent and flexible manipulation of data and the AI based 
techniques needed to solve the problem, the Common Lisp HTTP server (aka 
CL-HTTP (Mallery, 1994)) was the prime choice rather than other standard non- 
programmable HTTP servers. This tool was tailored to meet user requirements 
and to create modules capable of performing the following: 

— detect the trace of students’ learning process, 

— identify the students’ profile and their needs in terms of pedagogical mate- 
rials, 

— propose and sometimes impose to these students, the training mode that 
meets their profile and courses which will be most adequate for them, and 

— evaluate the students’ learning process and predict their achievement. 

These objectives are met according to the functional system, which controls 
user’s Web navigation independently of the course design stage. 

3 Constrained Web Walking 

3.1 W4 

As the World Wide Web has grown, Web walkers have settled into two general 
applications: site maintenance and high-volume indexing. In these roles, the 
walkers have been tuned for specific activities that are applied uniformly over 
Web regions. 

The W4 constraint-guided Web walker is a second generation Web walker 
intended for traversing well-specified regions of the World Wide Web and per- 
forming any variety of actions. Control of the walk is specified with an extensible 
vocabulary of constraints that limit enumeration of Web resources. Actions ap- 
plied to each accepted resource are specified by an extensible action vocabulary. 
Conditional branching in constraints and actions makes possible adaptive re- 
sponses to Web topology. Most importantly, constraint and action abstractions 
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enforce a separation of control from action as they encourage reuse of control 
and action abstractions. W4 extends the abstractions of this server and basic 
client to accessing Web resources and walking Web structures. 

3.2 Using W4 as a Template 

The initial hypothesis we made was that using traditional teaching methods 
through the Web would save time and effort, and could provide itself a novel 
and innovative delivery medium. This statement was merely too simplistic. Time 
and space resources were certainly saved; however students seemed uncomfort- 
able assuming a bigger part of the responsibility for their course success/failure. 
Indeed, in most instances, teachers and students are miles apart during most of 
the instructional process and never meet face-to-face at all. As a result, students 
become responsible for both the course content and the process by which they 
acquire and manage this content without supervision. 

With respect to these considerations and based on the filtering model of W4 , 
we developed a system that is powerful enough to structure and provide intelli- 
gently pedagogical material to the students according to their profile and their 
history. This system, called WebGuide, is a actually prototype including intelli- 
gent tutoring methods allowing teachers to organize and constraint navigation 
of the supervised Web site by the students. Teachers and authors use this tool to 
manage and monitor students’ learning process. Students choose transparently 
the training mode, which best meets their profile. 

4 WebGuide: Constrained Browsing 

4.1 Exporting URLs 

Writing Common Lisp functions that compute responses to incoming HTTP 
requests is the main feature of the CL-HTTP. Response functions compute a reply 
to the HTTP methods get or post. Before returning HTML to the user, they 
must arrange for an appropriate status code and appropriate headers to be 
returned. A response function becomes accessible only after an associated url 
is exported with export-url. An example exported url follows: 

(export-url #u"/cl-http/ cem/ index.html" 

: html - c omput ed 

: authentif ication-realm iminimum-security 
: expiration ’ (: no-expiration-header)) 



4.2 Scanning and Mapping Web Site 

The first step in the process of filtering navigation on a Web site is to build a 
list of all meaningfull urls and to export them through the CL-HTTP export-url 
mechanism. This task can be accomplished, for instance, by a Web walker. Each 
targeted url is remapped automatically as soon as it is scanned by the walker 
(see Figure 1). 
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4.3 Basic Mechanism and User Profile 

Trapping the requested user url via the redirect-to-html keyword and redi- 
recting it to WebGuide for further processing is the basic mechanism used for 
harnessing the navigation on the site. In other words, the user url: 

(export-url #u"/ cl-http/ cem/ index.html" 

: html-computed 

: authentif ication-realm : minimum-security 
: response-function # ’ redirect-to-html 
: expiration ’ ( : no-expiration-header) ) 

is mapped to the computed url: 

(export-url #u"/ cl-http/ cem/@index. html" 

: html-f ile 

: pathname "/dev/ cem/ index.html" 

: authentif ication-realm : minimum-security 
: expiration ’ ( : no-expiration-header) ) 

if the current set of constraints is satisfied. If this the case, the url will displayed 
in the browser (see Figure 1). 

The enabling of actions and activities for a particular user is made possible 
through the definition and activation of user profiles. Each potential user owns 
a set of properties defining the type of ressources he may or may not access. 
Creation of user profile is done by calling the make-client macro. For instance, 

(make-client : user-name "smith" 
ipassword "John" 

: email-address "smithQinrs-telecom. uquebec . ca" 

: status "teacher") 

There are currently four categories of users: student, professor, author, and 
administrator, each having its own set of properties. 

4.4 Constraints 

Constraints are instances of constraint types. Constraint types serve as tem- 
plates governing the behavior of constraint instances. They hold general-purpose 
functionality governing their instances while their instances store specializing pa- 
rameters. Circumstance constraint types are special constraints that operate on 
collections of constraints, and thus, accept arguments which are constraint struc- 
tures. Among other things, these constraint types implement logical operations 
and conditional branching over constraint structures. For instance, the following 
macro call: 

(def ine-constraint-type 
check-if -professor 
(:url 

: documentation "Shows report page when user is a professor.") 
(constraint activity url user) 

(equalp (client-status (find-client (user-name user))) "professor") 

defines a constraint (called check-if-prof essor) that is satisfied only when the 
current user belongs to the professor category as shown in Figure 2. 
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Fig. 2. Constraint applied to the professor or student category 



4.5 Actions 

Each activity contains a set of actions that are applied to urls that have passed its 
associated constraint set. Actions are instances of action types, whose behavior is 
parameterized by their arguments. Primary actions are the basic kind of action. 
These perform some operation on a url. An example is an action that writes 
HTML describing the current url to the client stream. For instance, the following 
macro call: 



(def ine-action-type 
redirect-output 
( : encapsulating 

: class encapsulating-action-type 

: dociimentation "An action that redirects a page on STREAM.") 

(action activity url stream user) 

(let* ( (url-address (url:name-string url)) 

(slash (search "/" url-address :from-end t)) 

(url-prefix (subseq url-address 0 (1+ slash))) 

(url-suffix (subseq url-address (1+ slash)))) 

(setf (client-history (find-client user)) (pushnew url (client-history (find-client user)))) 
(redirect-request http: : *server* #u(concatenate ’string url-prefix "0" url-suffix)))) 

defines an action (called redirect-output) that implements the basic redirec- 
tion mechanism (see Figure 3). 



4.6 Activities 

Activities collect a set of constraints to guide a site walk and a set of actions that 
are performed on visited urls. A site walk is initiated by applying the generic 
function walk to a url and an activity. One can think of an activity as a complex 
argument to a function, containing a number of interrelated parameters that are 
invoked in different ways during a recursive process. Rather than pass all these 
arguments separately, here they are bundled into named objects that can be 
reused. Activities can be defined with def ine-activity or they can be created 
on the fly with the macro with-activity. In each case, textual specifications 
for constraints and actions are passed to routines that allocate corresponding 
objects used during the walk. For instance, the following code: 

(defmethod redirect-to-html ((url url: authentication-mixin) stream) 

(let ((realm (url: authentication-realm url)) 

(capabilities (url: capabilities url)) 

(authorization (get-raw-header : authorization) ) 

(user (current-user-object) ) ) 
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http://adjani.inrs-telecom.uquebec.ca:8000/cl-http/cem/activities.html 




Constraint ? 




No N. Yes 

http://adjani.inrs-telecom.uquebec.ca: 8000 /cl-http/cein/acti vities.html 



Explanation or error message 



Fig. 3. Basic redirection mechanism 



(declare (ignore realm capabilities authorization)) 



(or 



(with- activity 
("display" 

rconstraints ‘((or (check-user-path ,user) (check-if-prof essor ,user))) 
tactions ‘ ( (redirect-output () .stream .user))) 

(walk url activity) ) 

(with- activity 
("explain" 

rconstraints ‘((not (check-user-path .user))) 
tactions ‘((signal-error () .stream))) 

(walk url activity))))) 



defines two activities. The first one (display) actually displays the current url 
if the user has fulfilled the mandatory class requirements or if he is a teacher. 
The second activity explains to the user why the url request has failed. 

5 Conclusion 

In this paper we have described the re-engineering of a Web walker that uses a set 
of constraints to characterize traversals of Web structures and performs actions 
specified in the action vocabulary. This system, called WebGuide, provides an 
environment for creating and reusing abstractions that constrains regions brows- 
ing of any given Web site and perform actions over them. The initial dictionary 
provided with WebGuide can be extended to support intelligent agents perform- 
ing resource management of Web sites. Future works will concern privacy and 
information confidentiality which were not part of the actual prototype design. 
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Abstract. Topic spotting is the task of assigning a category to the document, 
among the predefined categories. Topic spotting is called text categorization. 
Controlled indexing is the procedure of extracting the informative terms re- 
flecting its contents, from the text. There are two kinds of repositories, in the 
proposed scheme of topic spotting; one is the integrated repository for con- 
trolled indexing and the other is topic repository for topic spotting. Repository is 
constmcted by learning the texts, and consists of terms and their associated in- 
formation: the total frequency and IDF (Inverted Document Frequency). An un- 
known text is represented into the list of informative terms by controlled index- 
ing referring the integrated repository and the category corresponding to the 
largest weight is determined as the topic (category) of the text. In order to vali- 
date, the news articles from the site, “http://www.newspage.com” are used as 
examples, in the experiment of this paper. 



1 Introduction 

Topic spotting is the process of assigning the topic most related with its contents to 
the document among predefined ones [1]. Topic spotting is identical to text. There is a 
task similar as topic spotting, called text routing. Text routing is the process of re- 
trieving documents related with the topic or the category given as a query [1]. On 
contrary, topic spotting or text categorization is the process of retrieving the topic or 
topics related with the document given as a query [1]. 

The researches about the technique of automatic topic spotting have been pro- 
gressed. In 1995, Wiener proposed the application of the most common neural net- 
work model, backpropagation, to topic spotting in the thesis of master of the Univer- 
sity of Colorado [1]. In his thesis, text is represented into feature vector, of which the 
features are the selected terms [1]. In 1995, Yang proposed noise reduction to im- 
prove efficiency the application of LLSF (Linear Least Square fit) to topic spotting. 
Noise reduction is the process of eliminating the terms not representing the contents 
and functioning grammatically [2]. In 1996, Kalt proposed the new probability model 
for text categorization [3]. In this scheme, text is represented into feature vector and 
the new probability model estimates the probability of each category [4]. Cohen pro 
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posed the hybrid model combining of sleeping expert model and RIPPER considering 
the context in the text [5]. Lewis proposed the first trainable linear models, Widrow- 
Hoff algorithm and Exponential Gradient algorithm [5]. Larkey proves through his 
experiments that the combined model is superior to the individual model of K-Nearest 
Neighbor, Bayesian classifier, and relevance feedback [6]. In 1997, Joachims pro- 
posed the application of SVM (Support Vector Machine) to text categorization, in 
order to mitigate the curse of dimension [7]. Sahami, Hearst, and Saund proposed the 
combination of supervised learning model and unsupervised learning model to text 
categorization [8]. In 1999, Yang proved that K-Nearest Neighbor and ELSE are supe- 
rior to WORD in the performance of text categorization [9], and apply the techniques 
of text categorization to event tracking, which mean the decision whether the article 
focuses on a particular event, or not [10]. In the above schemes of text categorization, 
the text is represented into the feature vector, of which features are the selected term. 
But such representation has the problem, called the curse of dimension, which means 
that too large dimension of the feature vector makes the performance of text categori- 
zation poor. In order to avoid this problem, Jo proposed that a text is represented into 
the list of informative terms instead of a feature vector for text categorization [11]. But 
this scheme was validated in the only toy experiment, in which categories are politics, 
sports, and business. Its precision is reach more than 95% [11]. The scheme of text 
categorization proposed in [11] is applied to a function, automatic knowledge classifi- 
cation, which reinforces the product of KMS (Knowledge Management System), what 
is called KWave. 

In [11], the used jargon was not academic, because the paper was written just after 
the development of the module of the product for text categorization. And the experi- 
ment was very small; the predefine categories are politics, sports, and business. The 
number of documents for training is only 300, and the number of documents for test 
only 30. In this paper, the jargon for text categorization will be changed to more aca- 
demic and the experiment to validate the scheme of text categorization proposed in the 
literature [11] will be done to more close to real experiment. The number of categories 
is increased from 3 to 9, the number of documents for training is increased from 300 
to near 1000, and the number of documents for testing is increased from 30 to 90. In 
the jargons for topic spotting, “back data” is changed to “repository”, “integrated 
backdata” is changed to “integrated repository”, “categorical back data” is changed to 
“topic repository”, and “keyword filtering” is changed to “controlled indexing [12]”. 
The goals of this paper are two; one is that the scale of the experiment is increased to 
validate the scheme of topic spotting proposed in [11] and the other is that the jargons 
of the proposed scheme used in [11] is changed to more academic. Therefore, this 
paper is the Revised Version of the literature [11]. 

In the organization of this paper, in the next section controlled indexing will be de- 
scribed, and in the third section, the scheme of topic spotting will be described. In the 
4* section, the scheme is validated its result will be presented through the experiment. 
In 5* section, the meaning of this paper and the orientation of the future research will 
be mentioned. 
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2 Controlled Indexing 

Controlled indexing is the process of representing the given text into the list of infor- 
mative terms, terms reflecting its content. The process ofdocument indexing retrieves 
a list of termsincluded in the text. The terms representing its contents are selected 
based on a particular measurement. The process of controlled indexing is presented 
like the figure 1. 



Text 




Terms 

Fig. 1. This process is the controlled indexing proposed in this paper. 

As presented in the figure 1, text is represented into the list of all terms contained in 
itself by text analyzer. Term selector selected terms representing its content enough 
with the reference to the integrated repository. The integrated repository is the tabular 
form consisting of term, its frequency, and its IDF (Inverted Document Frequency). It 
is constructed by learning texts. The texts for constructing the integrated repository are 
called training documents or training texts. 



3 Topic Spotting 

In this section, topic spotting , for which the text is represented into the list of 
informative terms, will be described, the process of topic spotting is represented like 
the figure 2. 

Text is represented into the list of informative terms by controlled indexing mentioned 
in the previous section. The learning process of each topic repository is identical to 
that of the integrated repository. But the difference from the integrated repository is 
summarized in the table 1 . 
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Informative 
Term List 



T 

Topic 
Spotter 

cigory 

or 

Categories 

Fig. 2. This process is the topic spotting proposed in this paper. 



Topic repositories 




Table 1. The difference between Integrated and Topic Repository 





Integrated Repository 


Topic Repository 


training texts 


Unlabeled 


Manually Labeled 


Number 


Single 


#predefined topics 


Function 


selecting informative 


assigning probability 




terms 


of each category 


Weight 


substantial weight 


categorical weight 



In the table I , the number of topic repository is same to the number of the predefined 
topics, while the integrated repository is given single. The training documents for each 
topic repository should belong to the homogenous topic. In other words, all training 
documents for topic repositories should be labeled manually. By referring the topic 
repositories, the categorical weights of each informative term are computed. The 
measurement, categorical weight, is the degree in which the term reflects the category. 
For example, the terms, “Clington”, “President”, or “Minister”, have the high 
categorical weights of politics, because they are included mainly in the news articles 
about politics. But these terms have the low categorical weight of sports, because they 
are inlcuded very little in the news aricles about sports. 



4 Experiment & Results 

In this section, the experiment of topic spotting and the result of the validation of the 
proposed of topic spotting will be presented. The corpus of this experiment is the set 
of news articles from May 1999 to 15* July. The web site of the news articles is 
“http://www.newspage.com”. The categories of the news articles are like the following 
this. 

Business 

Healthcare Regulation 

Migration 

Pharmacy 
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Politics 

Pulbic 

Sports 

Wireless communication 
WeirdNuz 

The integrated repository is constructed by learning the training documents: 1000 
news articles regardless of their categories ( intending the categories not listed above). 
The number of training documents for topic repositories is given 100 news articles per 
topic. The total number of training documents for these repositories becomes 900. 
And the number of news articles for testing is 10 per topic. 

The informative terms are selected by ranking terms in the descending order of 
substantial weight. The rank thresolds are 5, 10, 15, 20, 25, and 30. The substnatial 
weight of each term is computed by the simple equation, in this experiment. 

SW{t.^) = F{f„tf„idf,) = —^ (5) 

tf.+idf, 

SW{t.^ = F{f,) = f, ( 6 ) 

Note that if there is no identical term in the integrated repository, the equation (6) is 
applied to the computation of the substantial weight of each term. 

To each informative term, the categorical weight is emputed by the following 
equation. 

cw,j = F(SW. , tf,j , idf.j ) = tfy+ idfy (7) 

If there is no identical term in the topic repository to an informative term, the 
categorical weight, CW^ , is 0. In this experiment, only one category corresponding to 

the largest value of probability is assigned to each news article. The measurment of 
the performance in topic spotting is the precision expressed like the following this. 

„ . . The Number of News Articles Correctly Classified /o\ 

Pecision = ^ 

The Number of News Articles for Testing 

The result of topic spotting is presented like figure 3. 




Fig. 3. The Result of Topic Spotting 

In the figure 3, the x-axis means the number of informative terms, and the y-axis 
means the precision. In the interval between 10 and 20, the precision of the topic 
spotting is increased outstandingly, and in the interval between 20 and 30, the preci- 
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sion is increased very little. The figure 3 presents that the efficient numher of the in- 
formative terms is 20. 



5 Conclusion 

The scheme of topic spotting proposed in the [11] is described more formally and 
validated in the experiment, which the numher of categories is increased from 3 to 9, 
and the number of test documents is increased from 30 to 90. In the experiment of the 
literature [1 1], the precision of topic spotting is even more than 95%. If the domain of 
corpus is extended, the precision is reduced from 95% or 71%. The equations 
computing substantial weight and categorical weight of each term can be developed in 
several. It is important to the optimal equation of substantial weight and categorical 
weight, which generate the maximal precision. And it is necessary to extend the 
experiment for the validation of the proposed scheme and the comparison with other 
techniques of topic spotting should be more extended. In the future, the proposed 
scheme will be validated and compared with other techniques in the experiment using 
Reuters-21578 collection as test bed. This technique is applied to the development of 
the component of topic spotting for the product of KMS (Knowledge Management 
System), what is called KWave.. 
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Abstract. We describe some experimental results within a scenario in 
a simulation framework we are developing to enable experimentation 
of multi-agents behavior, measured by the total utility that agents can 
gather during a given time horizon. In this scenario the population of 
self-centered agents performs in an 80x80 grid with objects carrying in- 
formation (infons) of varying utility that several autonomous agents are 
trying to obtain. This model is an abstraction for a real world informa- 
tion marketplace where agents simply cannot cooperate all the time for 
various practical reasons. The aim of this work is to show how we can 
validate the connection of agent local behavior to global behavior in var- 
ious environmental situations. 

Keywords: multi-agent systems, modeling issues, large-scale agent popu- 
lation, validation by simulation 



1 Introduction 

For self-interested agents, that simply maximize their own utility, it is desirable 
that reasonable local behavior should lead to global reasonable behavior [3,4]. 
But if agents are untruthful or deceitful just to increase their utility by any 
means then harm might arise to the whole society. It is therefore important to 
evaluate basic behavior of large agent societies in assumed environments. Of the 
two forms of cooperation: (i) deliberate and contractual, and (ii) emergent (non 
contractual and even unaware), we are concerned here with the second one. 

The goal to resist exploitation by malevolent agents has seen some results 
in probabilistic reciprocity schemes [7] and prescriptive strategies that promote 
and sustain cooperation among self-interested agents. Exhaustive experiments 
are reported on emergent cooperation in an information marketplace for agents 
acting in various alternatives of the iterated prisoner’s dilemma [1]. 

For modeling multiple agents under uncertainty we have chosen ICL (in- 
dependent choice logic) [6]. Inspired by decision/game theory, it constitutes a 
solid foundation for evaluating multi-agent behavior over a time horizon. Our 
simulation framework materializes this approach. 
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We have constructed a model of multi-agents to experiment how a large num- 
ber of agents behave. As illustrative scenario a world resembling an information 
marketplace has been implemented. The agents in this scenario cooperate in a 
random fashion, capturing in this way the real world where, for various reasons, 
agents are not always able to do it, even when trying hard. By simulating various 
kinds of encounters between the agents we expect to evaluate those strategies 
that are favorable for particular agents and also how to counteract unacceptable 
behavior of some agents in a more realistic world. 



2 Agent Model 

For the the scenario of the information marketplace we use the theory of practical 
rationality with dynamic obligation hierarchies [2] to describe what agents have 
to do. A statement of the form Pref{a,4>,'tp){i) says that agent a prefers (j>toip for 
the interval i. PFObl{a,4>){i) states that (j> is among a’s prima facie obligations 
for interval i. 

Pref{a,PFObl{a,4>),PFObl{a,'ip)){i) says that, for interval i, 4> is a. more 
important prima facie obligation for agent a than if. A realisable prima faeie 
obligation for an agent a for interval i is a prima facie obligation of a’s for 
interval i which a can realize on interval i. 

The agent always has a set of obligations with preferences, defining a partial 
order over the set of obligations. The obligation pair for which no preference is 
given is considered indifferent. An agent a prefers f to if during the time 
interval [ii, Z 2 ] if the agent a prefers <f A ^if to if A ^(f during the time interval 
[ii, Z 2 ]. Obligations dehne the way the agent should behave in all situations. An 
obligation is realisable if there is some plan e by means of which a can achieve <f 
for i: At any given time step some of the obligations will seem realizable, even if 
they are not. Due to the uncertainty of the environment estimating realizability 
does not yield a unique value. In this way, the realizability of an obligation acts 
as a hlter in generating the set of realisable prima facie obligations. Of all the 
obligations in the hierarchy, the most preferred is selected for fulfillment. 



2.1 Conceptual Agent 

The scenario we have chosen to study the strategy of agents is a grid world 
with several agents trying to collect infons that appear and disappear. Infons 
are carriers of information characterized by its utility value varying in the range 
1 to 4. The agents try to maximize the global utility value of all infons they can 
collect. A sample showing possible changes in the grid world for time t=0,l,2,3 
is illustrated in hgure 1. 

All infons appear randomly, one in a square of the grid, with a life span 
described by the sequence 
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Fig. 1. Sample of grid world at time (a) t=0, (b) t=l, (c) t=2 and (d) t=3. 



for infon ik, that is its utility increases in time up to the maximum and then 
decreases again until 1, and in the next time step it disappears. Every self- 
interested agent A tries to get infons where j is the utility of the infon, to 
increase its own overall utility as much as possible. We can see in figure 1 that 
agent A steps westwards at time step 0 to get the infon ii at time t=l when 
adjacent to it. Agents prefer collecting infons with higher utility value, 

utility{ik) > utility{ij) Pref{a,get{ik),get{ij)) 

but it is their overall utility that they try to maximize. 

For a given state of the infon world an agent can establish which infons it 
can collect and which not. Some preferences of the agent A in the states shown 
in figure 1 are as follows. 

Pref {A,P FObl{get{i 2 )) ,P FObl{get{ii)))[0. .2] 
Pref{A,PFObl{get{i2)),PFObl{get{is))){0) 

Pref {A,P FObl{get{is)) ,P FObl{get{i 2 )))[l. -3] 

Pref {A,P FObl{get{i 2 )) ,P FObl{get{i4)))[0. A] 

Pref {A.P FObllgetlu)) ,P FObl{get{i2)))[2. .3] 

The preferences on prima facie obligations, the realisable prima facie 
obligations and those that have been realized over the interval [0..3] by agent 
A of figure 1 are shown in table 1. Here we can see the change of preferences 
in time and also the change of their realizability. Our agent A has been able to 
collect infons if and i^ both with utility value 2. 

3 Simulation Framework 

Various simulation tools have been developed for the study of multi-agent be- 
havior, including coordination and survivability [8]. Our simulation framework 
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Table 1. Agent preferences, realizable and realized obligations over the interval [0..3]. 
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is developed with Swarm [5], extended with a choice space and a utility space 
to observe global agent behavior [6] . 

3.1 Evaluation of Agent Behavior 

The expected utility for agent a under strategy profile a over an entire time 
horizon is [6] 

e(a> cr) = XI ^ 

T 

where p(cr, t) is the probability of the world r under strategy profile a and 
u(r, a) is the utility of the world for agent a. The strategy profile a depends 
on the choices made by the agent a according to its obligations and also its 
perception of the current world. 

3.2 Experimental Agent 

The experimental environment consists of several infons with varying utility 
value on an 80x80 grid. An agent can collect an infon when it is situated in a 
square adjacent to its own square. All agents can see just a part of the entire 
grid world given by the parameter 5, representing the number of squares in 
all directions (North, South, East, West). The agent always executes the most 
important goal among its set of goals: 

pref(A, 01,02) pfobl(A,01), pfobl(A,02), prefer (A, 01 , 02) . 

prefthan(A,V,X,Y, L) f indall (0 ,pref (A, 0 , o (V,X, Y) ) , L) . 
noreallist (_,[]). 

noreallistCA, [H|T] ) not (real (A,H) ) , noreallist (A, T) . 
mostpref (A,o(V,X,Y)) : - pf obi (A, o (V, X , Y) ) , real (A , o (V,X, Y) ) , 

prefthan(A,V,X,Y,L) , noreallist (A, L) . 

where o(V,X,Y) is an obligation to collect the infon with current utility V, 
on the grid with current coordinates (X,Y). The realizability of an obligation 
is currently calculated by a pessimistic estimation of the evolution of infons. 
Agents assume that infon evolution is linear, although in fact this could be in 
steps. 

An agent step is achieved in five stages: 
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— Prima Facie Obligation or what the agent should do: what the agent can 
see with a given vision (number of squares from its current position), that 
is obligations the agent is imposed to fulfill. 

— Preferences or what order the agent wants on tasks: a partial ordering 
relation on the infons seen by the agent. This relation makes use of the 
current utility of the infons and their position on the grid. 

— Realizable or what tasks the agent believes possible to be realized: subset of 
obligations realizable, that is there is a free path to that infon and the infon 
will not disappear from the grid in that time. 

— Obligation or the goal of the agent at a given step: the most preferred 
obligation in the set of realizable obligations. 

— Action or how the agent fulfills the current goal: the action that the agent 
will execute to fulfill the most preferred obligation. 

These stages can be visualized to allow the designer to eventually analyze how 
local behavior influences global behavior. An agent cycle in its interaction with 
the world is achieved in three execution phases: (i) perception, (ii) reasoning and 
(iii) aeting, during which the world does not change. 

4 Simulation Results 



txperiments with different duration 









/ 


10D00 - 




too • 


0000 - 






i 






i- 

"S 

I ‘ 

s 




4non « 
2000 - 


o- (>-<> -O -O 0-0-0 0-0-0 i1 

D W ^ 

T 1 1 1 1 ' 


200 • 
0 ■ 


f 

1 1 1 1 ’ 



0 !> 1U lb 20 0 b 10 1b 

NuMDtr or « 4 *nu vttion 



Fig. 2. (a) Agent behavior vs. number of agents (b) Agent behavior vs. vision. 



Initially the agents a distributed at random on the grid. In figure 2 (a) it 
is shown the influence of vision on the average utility collected by an agent. 
Each experiment was carried out with a varying number of infons and agents. 




Validating the Behavior of Self-Interested Agents 397 



We notice a threshold from which average utility increases very slowly with the 
increase in vision. 

Figure 2 (b) shows how average utility collected by an agent varies with the 
number of agents on the grid. Both the number of infons and the vision of agents 
was varied. We notice significant variation when the number of agents is small, 
when the conflicts on obligations is not signihcant. When the number of agents 
is increased the number of conflicts increases and the agents have difficulty in 
resolving conflicts. This indicates a threshold where more coordination will prove 
beneficial. 

5 Conclusions 

Instead of just designing good social laws, strategies for interactions with other 
agents that can promote and sustain cooperation among self-interested agents 
have already been reported [7]. Simulation was also used to provide ability and 
flexibility when modeling complex interaction among heterogeneous agents [8] 
Our abstract scenario for the information marketplace can easily be extended 
with other parameters describing in a concise manner the real world. The sim- 
ulation framework we are developing is more general as it allows various logical 
specifications of agents to be included and experimented with. Our next step 
will be inclusion of various kinds of agents defined in the literature and the 
measurement of their performance in terms of individual and overall utility. 
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Abstract. In human-computer interaction, user interface events can be recorded 
and organized into sequences of episodes. By computing their implication 
networks, episode frequencies, and some heuristic measures of interestingness, 
we can readily derive some application-specific episode association rules. In 
order to demonstrate the proposal method, we have developed a personalized 
interface agent that can take into consideration interface events in analyzing 
user goals. It can then delegate on behalf of the user to interact with the 
software based on the recognized plans. In order to adapt to different users’ 
needs, the agent can personalize its assistance by learning user profiles. 
Currently, we have used the Microsoft Word as a test case. By detecting and 
analyzing the patterns of user behavior in using Word, the agent can 
automatically assist the users in certain Word tasks. The pattern association can 
be achieved at several levels, i.e., text-level (phrase association), paragraph- 
level (formatting association), and document-level (style and source 
association). 



1 Introduction 

The ease with which a software system can be effectively operated by users is to a 
large extent determined by the design and complexity of a user interface. This paper 
explores the application of an interface agent that records the events of human- 
computer interaction (HCI) and discovers the consistent patterns of user behavior. 
Thereafter, the agent can provide just-in-time assistance to a user by predicting the 
most likely plan of the user and delegates part of the plan on behalf of the user. 

The advantages of incorporating an agent in the user interface are: (1) the interface 
is no longer static as it reacts to different situations and requirements, (2) the interface 
is seamlessly personalized as it learns the behavior patterns as well as styles of 
individual users, and (3) the software system can be manipulated in a semi- 
autonomous manner that significantly reduces the amount of intervention required. 

Earlier examples of user interface agents include Letizia |[^ and Let’s Browse [|^. 
These agents assist a user in browsing the World Wide Web by tracking user behavior 
and anticipating items of interest. These systems analyze user behavior by means of 
matching the keywords in the Web documents, whereas in our case, the agent will 
recognize user action plans by tracing and analyzing the action sequences of the user. 
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Our approach discovers the rules that can best describe and predict user behavior by 
finding frequently occurred episodes in user action sequences. Such an approach is 
inspired by the earlier work Window Episode (WINEPI) and Minimal Occurrence 
Episode (MINEPI) In order to reliably detect patterns from a limited number of 

observed sequence data samples, we employ a method of inducing implication 
networks [^. This method constructs a dependence relationship between two 
ordered events based on statistical testing. 



2 Problem Statement 

In our work, we regard the events of a user interface as episodes E,, and capture all 
the sequences of episodes S,. as a user is interacting with an application through the 
interface. Therefore, we can reduce the problem of finding user patterns in HCI into 
that of discovering frequent episodes a;, out of the sequences of events S,.. 

Figure 1 presents a schematic diagram of the above problem: Given the sequences 
of events, S,, S^, ..., find out some significant frequent episodes a. by applying some 
statistical tests on the sequences. In the figure, the frequent episodes are a,\ £,=> 
E^, which expresses a co-occurrence that event E^ will be followed by event E^, and 
: => => E^. 

Ej Es E2 El E2 Ej Es E2 

f : I I I I I I I I 

Given: I E4 Ei E2 Ej £5 E2 Ei E2 

(^D — 

Fig.l. Overview of the probkm 

3 Discovering User Behavior Patterns 

In what follows, we will describe how to find and use consistent behavior patterns in 
HCI. Throughout our descriptions, we will use MSWord as our test case to illustrate 
how our approach works. 

Specifically, we will develop a capability of Episode Identification and Association 
(EIA) in our user interface agent, which is called as Personalized Word Assistant 
(PWA). Figure 2 provides an overview of PWA. 
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3.1 Window Episode (WINEPI) 

In HCI, an event E is a pair of action type and occurrence time (//, t). a sequence S is 
defined as (s, E , T) where s is an event sequence, E is the starting time, and J is the 
ending time. In order to discover frequency episodes from an event sequence, a time 
window will be defined that divides a long sequence into a number of shorter 
sequences. For each window, we also consider it as a set W of events. W is defined as 
(w, tj where w is an event sequence, is the starting time, and is the ending time. 
Moreover, the time difference t - t is called the width of the window W, and it is 
denoted by width(w). 



The frequency of an episode is defined as the fraction of windows in which the 
episode occurs. That means, given an event sequence s and the window width is 
limited by win, the frequency of an episode E in s is: 

Er(E, s, win) = |[w g W(x, win) \ E occurs in w]l 



To be a frequent episode, it has to pass two tests: One is a frequency threshold test 
and another is a confidence threshold test, as given below, respectively: 



/r(E^ ^ Ej,s,vvin) 
fr(E ^,s,win) 



> min_ fr 



( 2 ) 



fr(E^,s, win) 
I W(s,w1m) I 



> min_ conf 



(3) 



3.2 Implication Relations 

Sometimes, an interface agent may need to discover frequent episodes with limited 
information. In order to deal with this situation, we utilize an implication induction 
algorithm [5, 6]. For each implication relation E^ E^, this algorithm computes the 
lower bound of a (1-aJ confidence interval around the measured conditional 



□ □ 
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probability Suppose that there have been A number of observations that violate 
E E., Thus, based on the binomial distribution BiniN , p ), it tests whether or 

a ^ ' error^ r mm'' 

not the probability of errors is less than a threshold, that is: 

P{x < N„J < O', (4) 



where ttr is the alpha error of this conditional probability test. If X is the frequency of 
the occurrence, then X satisfies a binomial distribution, whose probability function 
pjk) and distribution function FJk) are given below: 
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P min 
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(5) 



FAk) = p(X 
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^) = Z 
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( 6 ) 



3.3 Phrase Association 

The first level of assistance in the implemented PWA agent is called phrase 
association. At this level, PWA considers word(s) as an episode and a sequence of 
words as an event sequence. With a bag-of-word text document representation, it 
readily finds out the frequency counts of the words (episodes). By using the above- 
mentioned algorithms of EIA, it discovers various associations among words. If there 
is one word in an episode, we call it 1-gram. If there are two words, we call it 2-gram. 
The n-gram representation can consider up to five words in an episode. The PWA 
agent will provide assistance to a user whenever it detects a phrase association. 



3.4 Format Delegation 

The second level of EIA is called format delegation. At this level, it concentrates on 
the delegation of formatting and finds consistent formats in paragraphs. When a user 
changes certain part of a paragraph format, and continues to perform the same 
operations elsewhere, the PWA agent will detect a format change pattern between E^ 
and Fj, i.e., discover a frequent episode, F^ F^. With those detected frequent 
episodes, the agent can automatically delegate the task of applying the consistent 
format changes in other paragraphs. 

3.5 Document Style and Source Recommendation 

The third level of assistance is called document style and source recommendation, 
which is concerned with discovering relevant styles and sources for a certain 
document. Among various documents, there may be many different styles. The PWA 
agent first categorizes the documents into different styles. That is, the documents with 
similar styles are clustered and a template style will be created. Therefore, whenever 
the PWA agent detects that a user is about to create certain style, it can automatically 
make style template recommendations. 
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At the same time, document source recommendation locates and suggests various 
relevant document sources to the user. Whenever the user is writing some document, 
the agent will search both the local (file system) and global (WWW) for related 
contents. Both searches will apply the same methods of determining which document 
is to be associated, that is, document categorization based on feature selection [7, 8, 9, 
10] text weighting based on Term Frequency and Inverse Document Frequency 
(TFIDF) [11]. 



|4-j Experimentation 

In order [^validate the effectiveness of the PWA agent 
interface assistance, we have designed and conducted 
involving real users handling real documents. 



□ □□ 



in offering personalized 
two experiments, both 



4.1 Experiment 1 

In this experiment, we have one group of 10 users participating in the test. They are 
asked to write documents with the help of the PWA agent. First, the users have to 
decide which kinds of documents will be written in the experiment, and at the same 
time, provide some of the documents that are considered to be relevant and are written 
by them before. After the user profiles have provided such information, PWA will 
then try to build user profiles for the individual users at the above-mentioned first two 
levels. Thereafter the experiment starts - the users begin to write their documents. 
During the experiment, we record the number of suggestions offered and the number 
of acceptances. The averages are shown in Table 1. Both levels of suggestions have 
quite high percentage of correctness: 75% for phrase association and 86% for format 
delegation. 





Phrase association 


Format delegation 


Average number of suggestions 


175.14 


83.24 


Average number of acceptances 


131.14 


71.49 



Table 1. Results of phrase association and format delegation 



4.2 Experiment 2 

In the second experiment, we have two groups of users (G, and G^) participating in 
the test. One group of 5 users (G,) will write a document with the help of the PWA 
agent. And the other group of 5 users (G^) will write the document without the 
assistance of the agent. There are two sets of documents (D^ and D^) for the test, the 
context of the documents are very similar. First of all, the agent prepares user profiles 
for the G, users by storing and learning their related documents. The G, users will 
write documents with the PAW agent, while the G^ users will write by themselves 
without the help of PWA. Similar to Experiment 1, the total processing time and the 
number of operation steps are recorded. The results of the two groups are shown in 
Table 2. G, (with PWA) preformed better than G^ (without PWA) in both results. 
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Gi (with PWA) 


G 2 (without PWA) 


Averaged total 
processing time 


8 minutes and 56 seconds 


1 1 minutes and 59 seconds 


Averaged number of 
operation steps 


253 


401 



Table 2. Results of Gy and G 2 



5 Conclusion 

In this paper, we have described an interface agent that records the events of human- 
computer interaction (HCI) and discovers the consistent patterns of user behavior. As 
experimentally validated in the case of MSWord, the interface agent can effectively 
carry out the different levels of Episode Identification and Association (EIA), and 
thereafter, provide just-in-time assistance to users by predicting the most likely plan 
of the user and delegates part of the plan on behalf of the user. 
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Abstract. The Internet Search Service (ISS) was proposed to support an 
uniform interface for searching on the World Wide Web. Based on this service, 
a multi-search engine named Octopus had been built. In order to provide more 
services, such as personal functionality, we provide personalized search for 
users. In this paper, the policies of personalized search are described. In 
addition, in order to keep the advantages of the ISS, a personal information- 
Eltering agent is added into the Octopus instead of modifying the architecture 
or interface of ISS. Tlic feedback mechaniam is in cooperation with the filtering 
mechanism to acliieve tlic lunctionality of personalized search in a search 
engine. 



1. Introduction 

Most of search engines and multi-search engines [1,2] are developed only for 
WWW users, not for application programs that need to exploit data from the web. 
They also have no an uniform interface while accommodating new and powerful 
search engines in future, so that most multi-search engines are less the extensibility. 

We have proposed an uniform interface - Internet Search Service (ISS) [3] that 
follows the COSS ofOMG’s CORBA [4] to solve the problem described above. And, 
an experimental ISS-based multi-search engine termed Octopus has been built. With 
that Octopus can accommodate new search engines easily and support application 
programs to exploit data from the Internet. 

Most returned results from search engines could be not useful for users even if 
these results are ranked higher index. Search engines with the function of 
personalized search are strongly necessary for experts. Some well-known search 
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engines have supported this function, such as SavvySearch, MyYahoo etc. The main 
reason of personalized search is to offer most suitable query results to user. 

This paper will describe the design and implementation for supporting personalized 
search in the Octopus. In Octopus, an absolutely irrelevant filtering approach used to 
support the personalized search. 

In order to balance system load and user requirement, the filtering mechanism is 
divided into three levels - URL, Description/Context, and Content respectively. In 
addition, the feedback mechanism is in cooperation with the filtering mechanism to 
achieve the functionality of personalized search in a search engine. 

This kind of search service is favorable for WWW user, not for application 
programs. Therefore, this function is independent of the ISS. To support such service 
is only to redesign the architecture of Octopus. The original advantages of ISS should 
be reserved in providing other functionality. All the interfaces will be not modified. 
The major contribution of the paper is providing an approach to support the 
personalized search service based on the ISS without change its interface of the 
Octopus. 



2. Related Works 

A variety of search tools are offering the means of personalizing or customizing 
their sites to the individual user, such as Excite, Lycos, MyYahoo etc. The advantage 
of these tools like “push” services and provide you with up-to-date information 
tailored to your desires with little ongoing active effort on your part. But, how much 
information do you want to reveal about yourself? 

Famous multi-search engine that supports optimal search is the SavvySearch [2]. It 
automatically tracks the effectiveness of each search engine in responding to previous 
queries and creates a meta-index for future queries to decide which search engines are 
more adequate. Because one of major factors of adjusting the meta-index is the 
number of visitors, according to the meta-index to make the search plan is not 
adequate individual users. 

Amalthaea [6] is a multi-agent information filtering and discovery system. In the 
system, the information discovery agents refer to their history logs and check if the 
information-filtering agent that has been the most profitable for doing requests. If no, 
it proceeds to the next preferred filtering agent and check again. If yes, that request is 
selected. The drawback of Amalthaea likes the SavvySearch. 

Many other researches have been proposed their architecture to support 
personalized search [5, 7], These solutions are based on proprietary technique that is 
not easy to be applied by other application programs. 



3. Personalized Search Supports 

The ISS is designed by following the style of CORBA’s COSS. Its major goal is 
providing an uniform interface for most search engines. The details of ISS Octopus’s 
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scenario please refer to [3], In the section, we describe the personalized support on the 
Octopus. 

3.1 Design principle 

In keeping the advantage of the ISS, a personal information-filtering agent is added 
into the Octopus instead of modifying architecture or interface of ISS when adding 
personal functionality. Figure 1 shows the preliminary design of the Octopus with 
personalized search. The major difference between this architecture with original 
Octopus version is adding a personal information-filtering agent that used to filter 
users favor. Such design philosophy is in order to reduce the search overhead when 
similar query is requested repeatedly. 

The feedback mechanism is in cooperation with the filtering mechanism to achieve 
the functionality of personalized search. The latter used to find out adequate results, 
while the former let the user to respond what his/her favor is. In this paper, two 
mechanisms are adapted implicit feedback approach and absolutely irrelevant 
filtering approach respectively. 

Using implicit feedback approach instead of explicit one is in order to go with the 
filtering mechanism properly. The absolutely irrelevant filtering approach is based on 
the custom of user in searching information from large amount of URLs and 
descriptions. In generally, users will first visit those deemed more suitable of URLs 
and skip the others that symbolize irrelevant. The visited web page may represent the 
page is interested by user in some extent. To analyze those fully irrelevant URLs or 
descriptions may find out more relevant to what don’t he/she want than relevant 
approach and act as the filtering basis. This is the spirit of absolutely irrelevant 
filtering approach. 




The penoialced Kaich service model 



Figure 1. The preliminary architecture of supporting personalized search 

We can utilize classical Boolean model of Information Retrieval [8] to explain the 
concept. If is the set of relevant terms and K'l is the set of non-relevant terms. 
Then K=j^i U Kj will be all of terms that are included in returned results or 
documents. And K- Kj will be more relevant terms. So that we use absolutely 
irrelevant filtering approach to filter the non-relevant terms will get more relevant 
terms, 
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In the design, the personal information-filtering agent analyzes the factor of these 
non-relevant URLs or web pages and stores it into knowledge base for future 
searching requests. The knowledge base keeps personal filtering information. When 
the Search Engine Agent replies the user request, then the Information-Filtering Agent 
will filter the result in accordance with the personal filtering information. 



3.2 Level of Filtering 

The filtering mechanism is the corpus in supporting personalized search. In order 
to balance system load and user requirement, the filtering mechanism is divided into 
three levels - URL, Description/Contexf, and Content respectively. 

1. URL: This is a simplest filtering level. This level fillers the URLs of result 
that are selected and non-visited by user, into personal database. Those non- 
visited URLs imply absolutely irrelevant and act as filtering base for future 
searching request. Because this level is simplest, it has slightest overhead. 

2. Description/Context: Almost all returned results of search engines consist of 
URL and description. In the \e\e\, filtering mechanism analyzes vocabulary 
in description that excludes non-stop term' of all the non-visited web sites 
and applies the index model [8] to create an index for each term. When the 
index of certain term exceeds the pre-determined cutoff threshold, the 
filtering mechanism will keep it into the list of filtering terms and stores the 
analyzed results into filtering base. When an user issues a search request 
with this filtering level, the Information Filtering Agent will utilize the 
filtering base to filter the query result and to discard those irrelevant results. 

3. Content: The same technology as second level Is applied to this level with the 
exception of analyzed target is the full content of web site. Because the size 
of analysis is the largest in three levels, the overhead is also largest, 

All of filtering level is based on the implicit feedback mechanism that feed back the 
selected web sites implicitly. The detailed description is in next subsection. 



3.3 Architecture of Personalized Search 

Figure 2 shows the detail architecture in supporting the functionality of 
personalized search in Octopus. Based on ISS, some components are added into the 
system to support this function, such as User Profile, Filtering Database, Feedback 
mechanism, and Result processing mechanism etc. Follows describe the system 
scenario. The system first checks user identifier through User Profile. Once the user 
passes the check, the system will generate a query page for user to post the query 
string and wait for query request. Then, Result Filter will look for the query result 
from Result Cache. If missing the expected information, then Query Page Generator 
will submit this request to ISS’s mediator for searching new information and get the 
result through the Result Aggregator. The Result Filler thereat filters the results based 



' It means the non-terminal vocabulary, such as “be” verb, auxiliary verb, pronoun, adverb, and 
prefix etc. 
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on the information of personal profile and filtering database and passes the filtered 
results to user. Once the user receives the results, Feedback Mechanism is activated to 
monitor the user’s feedback. 

When a user wants to visit the web page through Result Display page, the visiting 
process will physically link to server’s CGI program that can log the visited history 
and redirect the visited URL to the web site. The recorded information just represents 
that the URLs are related. So that the situation to filter those irrelevant information 
are the user complete a review session after visit those related web pages and press 
the “Next Page” or “Back” bottom. 




Figure 2. The detail architecture of supporting personalized search 



3.4 Implementation and Performance Evaluation 

Octopus’s underlying system is run on two platforms (a SPARC and a Windows 
NT). The ORB of this system is lONA’s Orbix 2.02, which is a full implementation 
of CORBA specification. 

Table 1. The overhead of Octopus with personalized search in two representative search 
engines (sec) 



it of Returned 


EM 


m 


fSM 




itiKili 


IFM 


lEM 


IBM 


IliM 




AltaVista 


BiHil 


tann 


BEQ 


lEig 


ir<f Irl 


iraai 


EW 


ESEa 


Eraig 


ItMIIXl 


A.V. Mediator 


HHl 


BQI 


HHI 


IMEM 


llrMI 


Iran 


MQI 


Eran 


wriKja 


Baig 


Overhead 


rang 


0.57 


003 


filCT 


0.75 


0.8 


1.32 


1.37 


1.88 


2.63 


Y ahoo 


Htdri 


HHI 


EBI 




Irani 


lEMIl 


IHEH 


an 


EHEg 




Y. Mediator 


nen 




HBH 


taEW 


iiigg 


rem 


itga 


Ellligl 




Eraai 


Overhead 


0.37 


0.47 


0.36 


Q9H 


0.42 


om 


0.51 


0.54 




0.7 



A new user must register his/her personal information. It also has to select the 
filtering level and cutoff threshold before issue the searching request. We performed 
some preliminary measurements to assess the overhead of Octopus system with 
personalized search function, as shown in table 1. There are two search engines used 
to assess the overhead of Octopus, Altavista and Yahoo respectively. The first row 
shows the number of returned item from the search engine. From the result, though 
the overhead of Internet is unpredictable in most of situations, it is obviously that our 
system is efficiency. The average of total overhead is about 6.5%. We believe that the 
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major overhead are filtering the returned references and networking overhead of 
CORBA, But these operations are parallel execution, the overhead will not increase 
linearly, i.e., it will not rapidly increase with the number of search engines increase. 



4. Discussions and Conclusions 

In this paper, we have described the policy of supporting personalized search based 
on ISS. We also have implemented these functions into Octopus. In order to keep 
advantages of ISS, a personal information-filtering agent is added into the Octopus 
instead of modifying architecture or interface of ISS. In Octopus, an absolutely 
irrelevant feedback approach used to support the personalized search. 

Many advantages of using the ISS to build a multi-search engine have been raised 
in [3]. There are other advantages that are discovered in the design. First, a personal 
information-filtering agent is added into the Octopus instead of modifying 
architecture of interface of ISS when adding personal functionality. We believe this 
design is more suitable to exploit useful data in the other application. Second, because 
the interface of ISS is based on the distributed object-oriented technique and the 
modules of personalized search are implemented as replaceable components. It is ease 
to replace these components when a new and more suitable algorithm is proposed. 
Third, each user with specific domain has individual profile in Octopus. It might 
avoid the Octopus return unsuitable results to users. Finally, because the filtering 
mechanism is divided into three levels, it can balance the load and the user 
requirement. 
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Abstract. Recently, a huge quantity of HTML documents have been 
created in Internet, which really constitute a treasury of information. 
HTML, however, is designed mainly for reading with browsers, and not 
suitable for machine processing, whereas XML was proposed as a solu- 
tion for this problem. In this paper, we give a case-based transformation 
method from HTML documents to XML ones. There are many series of 
HTML pages in actual Web sites, and each page of a series usually has 
a quite similar structure with each other. Therefore a case-based trans- 
formation must be a promising method in practice for a semi-automatic 
transformation from HTML to XML. Throughout experimental evalua- 
tions, we show this case-based method achieved a highly accurate trans- 
formation, i.e., 85% of actual 80 pages can be transformed in a correct 
way, with this case-based method. 

1 Introduction 

In order to utilize a tremendous amount of information in Internet, machine 
processing of HTML documents has been becoming quite important. HTML, 
however, is designed mainly for reading with browsers, thus not suitable for ma- 
chine processing. A wrapper is an information extraction technology for HTML, 
but it is difficult to automatically develop and maintain wrappers [8]. Recently 
XML was proposed as a solution for this problem [7, 13]. We are addressing the 
problem by transforming from HTML documents to XML ones. Unfortunately, 
full automatic transformation from HTML to XML is also extremely difficult, 
because it absolutely needs to understand the meaning of HTML documents. 
On the other hand, there are indeed many series of HTML pages in actual Web 
sites. Each page of a series usually has a quite similar structure with each other. 
Therefore a case-based transformation must be a promising method for such an 
HTML-page series in practice [9]. 

In this paper, we give a case-based transformation method from HTML doc- 
uments to XML ones. The transformation method consists of two phases: a 
sample-analyzing phase and an XML-document-generating phase. Given a se- 
ries of HTML documents and a sample transformation from an HTML document 
among the series into an XML document, the case-based method hrst analyses 
both of the syntactic and semantic features embedded in the sample transfor- 
mation, and next automatically transforms the remaining HTML pages of the 
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series into XML documents by using the information extracted from the sam- 
ple. We adopt a vector model of term weighted frequency for approximating the 
meaning of HTML documents, and also use both of headlines and a parsing-tree 
as syntactical information. 

The rest of this paper is structured as follows: Section 2 explains the sample- 
analyzing phase. Section 3 shows the generating phase of XML documents. Sec- 
tion 4 discusses our method throughout experimental evaluations with actual 
HTML documents collected from Internet, followed by some conclusions in sec- 
tion 5. 

2 Analyzing a sample transformation 

HTML documents are transformed into XML ones with a sample transforma- 
tion. A given sample HTML document is called a sample H-document and a 
given sample XML document is called a sample X-doeument . The other HTML 
documents among a given series, which will be transformed into XML, are called 
target H-doeuments . For example. Fig. 1 depicts a sample pair of HTML and 
XML documents of a transformation. Fig. 2 shows a target H-document. 



<HTML> 

<HEAD> 

<TITLE>Spa Guide</TITLE> 
</HEAD> 

<B0DY> 

<Hl>Shirahone spa</Hl> 
<H2>Charge</H2> 

<P>500 yen</P> 

<H2>Business Hours</H2> 
<P>From 10:00 to 17:00</P> 
</B0DY> 

</HTML> 



< ! ELEMENT spa_guide 

(name, charge, business_hours)> 
<! ELEMENT name (#PCDATA)> 

<! ELEMENT charge (#PCDATA)> 

< (ELEMENT businessjiours 

(#PCDATA)> 

<spa_guide> 

<name>Shirahone spa</name> 
<charge>500 yen</charge> 
<businessJhours>From 10:00 to 

17 : 00</business_hours> 
</spa_guide> 



Fig. 1. A sample pair: The left is a sample H-document. The right is a sample X- 
document. 



<HTML> 

<HEAD> 

<TITLE>Spa Guide</TITLE> 
</HEAD> 

<B0DY> 

<Hl>Fefukigawa spa</Hl> 
<H2>Charge</H2> 

<P>600 yen</P> 

<H2>Business Hours</H2> 
<P>From 10:00 to 19:00</P> 
</B0DY> 

</HTML> 



Fig. 2. A target H-document. 



In the sample-analyzing phase, we investigate some features of texts in a 
sample H-document and a relationship between a sample H-document and a 
transformed XML document. 



2.1 Analyzing features of texts in a sample H-document 

We divide the text part into several blocks. A text bloek is defined as a text 
part enclosed with a pair of HTML tags. Specially, text blocks in a sample H- 
document are called s-bloeks. 
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For example, text blocks of an HTML document in Fig. 1 are ”Spa Guide”, 
” Shirahone spa” , ” Charge” , and so on. 

Next, we analyze each s-block on two aspects: syntactic structure and seman- 
tic information. 

Analyzing syntactic features. We consider here the headlines embedded in 
a sample H-document and also uses the parsing-tree information. A headline is 
defined as an s-block which is enclosed with <H> tags in sample H-document. A 
headline of an s-bloek B is defined as a headline appearing immediately before 
B . A parsing-tree path of an s-bloek B is defined as a path from the root to B 
on an HTML parsing-tree. 

For example. Fig. 3 is a parsing-tree of an HTML document shown in Fig. 1. 
Therefore, the s-block ”500 yen” in Fig. 2 has the headline ’’Charge” and the 
parsing- tree path ’’HTML - BODY - P”. 




Fig. 3. A parsing- tree of an HTML document in Fig. 1. 



Analyzing semantic features. We consider a term vector as the meaning 
of an s-block. A text block would be represented by a term vector of the form 
Vd = (wi, W 2 , where each Wi corresponds to the weight of the term i [3, 

10]. We use WIDF {Weighted Inverse Document Frequency) [10,11]. The 
WIDF weight of a term t in a s-block d is defined as follows: 



WIDF(d,t) 



TF{d,t) 

E,=iTFK,t)’ 



( 1 ) 



where TF(d, t) is the number of occurrences of the term t in the s-block d, and 
i ranges over s-blocks in the sample H-document. 



2.2 Analyzing a relationship between HTML and XML documents 
in a sample pair 

In a sample pair, each s-block in the sample H-document is embedded into the 
sample X-document, and is enclosed with a pair of XML tags. We call such 
a enclosing XML tag an E-tag. To analyze a relationship between a sample 
H-document and a X-document, we make up a table consisting of the correspon- 
dence relations between E-tags and the enclosed s-blocks. 
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3 Generating XML document 

In the second phase, we generate an XML document from a target H-document 
by using the analysis result of a sample pair. 

At first, we divide a target H-document into several text blocks, which are 
called t-hlocks, and analyze features of each t-block with the same method as in 
analyzing s-blocks of a sample H-document. 

Next, we generate a skeleton of an XML document according to the DTD 
data in a given sample X-document. Namely, we produce E-tags with a null text 
data if the DTD indicates that the E-tags should have a text content.^ Otherwise 
we just produce XML tags, according to the definition in DTD. 

For example. Fig. 4 is an XML skeleton generated from the DTD in a sample 
X-document in Fig. 1. 



<spa_guide> 

<name> </name> 

<charge> </charge> 

<business_hours> </business Jhours> 

</spa_guide> 



Fig. 4. A generated skeleton of XML document according to the DTD in Fig. 1. 



At last, we assign an optimal t-block to a null text column enclosed by E-tags 
in a generated XML skeleton. The optimal t-block for an E-tag is the t-block 
which is the most similar to an s-block enclosed with the E-tag in a given sample 
X-document. 

We consider a synthetic similarity measure between two text blocks, which 
consists of syntactic and semantic measures. 



3.1 Calculating similarity measure 



We formalize a semantic similarity between two text blocks which would be 
obtained by comparison of the term vectors. We consider both of the cosine of 
the angle and the ratio of the length of two vectors. Thus, a semantic similarity 
between vectors Vi and Vj, Sim(Vi,Vj) is defined as 



Sim(Vi, Vj) 



ELiKfe X Wjk) ^ mindE^lEjl) 

max(|Hd,lEl)’ 



(2) 



where means the fc-th element of the term vector Vj. The greater the value 
of Sim(Vi,Vj) is, the greater the similarity between two text blocks is. Notice 
that 0 ^ Sim(Vi,Vj) ^ 1. 

Two text blocks can be regarded as more similar if the corresponding head- 
lines and the parsing-tree paths are identical with each other. Thus, the synthetic 

^ At this point, we can not yet decide which t-block should be embedded in a text- 
data-column of an E-tag. 
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similarity between text bloeks Ti and Tj, S{Ti,Tj) is defined as 



S{Wj) 



' Sim{Vi,Vj) + a + 13 if both of the headlines and the 
parsing-tree paths are identical 
with each other. 

< Sim(Vi,Vj) + a if only the headlines are identical. 

Sim(Vi,Vj) + (3 if only the parsing-tree paths are 

identical. 

^ Sim{Vi,Vj) otherwise. 



(3) 



where Vi means the term vector of the text block Tj , and a and (3 are some 
constant numbers. Notice that 0 ^ S{Ti,Tj) ^ 1 + a + (3 . 

For example, Fig. 5 shows a generated XML document from a FITML page 
in Fig. 2, where some appropriate t-blocks fill null text columns in the XML 
skeleton shown in Fig. 4. 



<!ELEMENT spa_guide (name , charge, business_hours)> 

<! ELEMENT name (#PCDATA)> 

<! ELEMENT charge (#PCDATA)> 

<! ELEMENT businessjiours (#PCDATA)> 

<spa_guide> 

<name>Fefiikigawa spa</name> 

<charge>600 yen</charge> 

<business_hours>From 10:00 to 19:00 </business_hours> 
</spa_guide> 



Fig. 5. A generated XML document from a target H-document in Fig. 2. 



4 Evaluations 

This section describes some experiments to evaluate the performance of the case- 
based transformation method from HTML to XML. We tested the proposed 
case-based method with 80 actual HTML documents of 8 series. The aeeuraey 
is evaluated by 



accuracy = a/b*lQQ, (4) 

where a is the number of text blocks which are enclosed with a pair of correct 
E-tags, b is the total number of text blocks in the generated XML document. 



Results. Table 1 shows the results of the experiments, where Series is a name 
of a series of HTML documents, dn is the number of transformed target H- 
documents in a series, tn is an average of the number of terms appearing in each 
s-block. 

We achieved a highly accurate transformation, i.e., 85% of actual 80 pages 
can be transformed in a correct way, with this case-based method. 

5 Conclusions 

We proposed a new case-based transformation method from HTML documents 
to XML ones. We used 80 actual HTML documents of 8 series for the experi- 
mental evaluation, which showed that the proposed method accomplished high 
accuracy. The case-based transformation should simplify the task of Internet 
information extraction, and is quite valuable for practical applications. 
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Table 1. A result of evaluation. 



Series 


dn 


tn 


average of b 


average of a 


average of accuracy 


[4J (in Japanese) 


4 


12.6 


5.0 


4.8 


95 


[14] (in Japanese) 


12 


26.0 


3.0 


3.0 


100 


[5] (in Japanese) 


4 


35.6 


8.0 


7.8 


97 


[15] (in Japanese) 


8 


6.0 


5.0 


4.5 


90 


[4] (in Japanese) 


5 


25.6 


10.0 


8.0 


80 


[12] (in Japanese) 


11 


9.6 


6.0 


4.7 


79 


[2] (in Japanese) 


8 


26.7 


3.0 


2.1 


71 


[1] (in English) 


20 


2 


4 


5.0 


80 


[6] (in English) 


8 


14.6 


8.0 


7.1 


89 


TOTAL 


80 








85 



As a future research, we are planing to extend the case-base transformation 
method in order to improve the accuracy. On the other hand, it is, unfortu- 
nately, impossible to avoid transformation errors completely within the case- 
based method. Therefore we are planing to develop on interactive editor for 
support the transformation. 
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Abstract. A generic architecture for the storage and retrieval of XML 
documents in relational databases is proposed. Documents are stored 
as node trees to facilitate retrieval and manipulation using an interface 
that conforms to the Document Object Model (DOM) specification of 
the World Wide Web Consortium (W3C). This approach offers many 
benefits, including the ability to leverage DOM programming techniques 
for the manipulation of XML doounents that would otherwise be pro- 
hibitively large for persistence in memory. 



1 Introduction 

The Extensible Markup Language (XML) [1], a data formatting recommenda- 
tion proposed by the W3C as a simplified form of the Standard Generalized 
Markup Language (SGML), is becoming the de facto standard for the platform- 
independent representation of information and its transfer between and within 
web-enabled applications. 

XML is a meta-language that facilitates the creation and formatting of 
domain- or application-specific conceptual models and document markup lan- 
guages in which the language elements (consisting of a start tag and an end 
tag, <foo> . . . </f oo>, or an empty tag, <f oo/>) can be defined by Document 
Type Definitions (DTDs) or XML Schema Definitions (XSD) [2]. Alternatively, 
XML documents may be entirely self-describing. 

XML documents that conform to the rules of XML mark-up are called “well- 
formed” ; for example, each document must have a single top-level (root) element, 
and all tags must be correctly nested (for an example document see figure 1). A 
number of additional instructions are iierrnitted, such as comments, processing 
instructions, unparsed character data and entity references. Tags can also contain 
attributes in the form of name and value pairs, with the values enclosed in 
quotation marks. 

One essential aspect of XML is that it allows semantically-rich content to 
be abstracted away from the presentation layer afforded by languages such sis 
Hyper Text Markup Language (HTML). Companion recommendations such as 
the Extensible Stylesheet Language (XSL) [4] allow information from XML files 
to be rendered into other text formats, including HTML for display on browser 
platforms, and other formats such as Rich Text Format (RTF) and DTe^. 
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1.1 Node trees 

Any well-formed XML document can be portrayed as a hierarchical node tree 
in which the nodes represent tag elements, text, or supporting information such 
as processing instnuU.ions ami comtnents. Figure I shows an example XML doc- 
ument and figure 2 shows the corresponding node tree. The labelling of each 

<wml> 

<card id="cFirst" title="First card" newcontext="true"> 

<p align="center"> 

<img src="/images/logo.wbmp" alt="Logo" align="middle"/> 

<br/> Welcome! 

</p> 

<p> <a href”"#cSecond">Next card</a> </p> 

</card> 

<card id»"cSecond" title«"Second card"> 

<p allgn“"center"> Some content ... </p> 

</card> 

</mnl> 



Fig. 1. Sample XML 



node with two numerical coordinates (designated x and y) by walking around 
the node tree from the root node, downwards and from left to right, provides 
simple algebraic methods of navigation, for example; 

1. The next sibling {x' ,y') of any given node (x,y) has x' = y + 1. 

2. The first child {x',y') of any given node (x,y) has x' = x + 1. 

3. The set of nodes that originate from any given node (x, y) have x < x' < y-1. 

1.2 The Document Object Model 

The Document Object Model (DOM) proposed by the W3C [3] provides an 
application programming interface (API) for XML and HTML documents. The 
model defines a logical structure for such documents, and the API facilitates 
doniment access and manipulation. A DOM object internalises an XML (or 
HTML) document fis a node tree, and exposes methods for the creation, retrieval 
and manipulation of its nodes. On completion of processing, the DOM object 
can be serialised as XML. 



1.3 Parsing XML 

There are two XML parsing approaches in common usage; one involves an event- 
based model and the other a compilation model. In the compilation method the 
XML document is encapsulated as a DOM object, providing rich functionality 
through the methods of the DOM API, but with limitations commensurate with 
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Fig. 2, Sample XML document represented as a node tree 



the amount of available memory and the CPU-intensive nature of this approach. 
In practice, documents with a size in excess of more than a couple of megabytes 
can become unwieldy and impractical. This is a significant ])roblern for real-world 
applications handling non-trivial data files. 

The event-based model, the Simple API for XML (SAX), offers an alternative 
that does not have the same limitation regarding large documents, but may 
provide less functionality for the application developer. SAX uses a serial-access 
mechanism with element-by-element processing. The XML document is scanned 
in a sequential fashion and key events (such as an element being encountered) 
trigger callback methods. Serial access allows for any size of file, but does not 
provide the same immediacy of contextual information as DOM encapsulation 
affords. 

Both parsing approaches can optionally incliKle validation of the supplied 
XML against a DTD or XSD. At the very least, they must ensure that the 
supi)lied document is well-formed. 

2 XML and Databases 

XML documents are generally considered as belonging to one of two categories; 
data-centric and document-centric [5,6]. Typically, data-centric XML documents 
contain structured information (of varying types) extracted from databases and 
other conventional data sources. In contrast, document-centric XML documents 
use XML markup to add semantics to irregularly structured text-based infor- 
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mation. These distinctions are not rigid, and XML documents that display the 
characteristics of both are increasingly common. 

A number of technologies for storing XML documents are currently available. 
Many database vendors offer some degree of XML support in the latest releases 
of their products, although these solutions are not yet mature, and either store 
XML documents as single entities, or require the user to map their schemas 
to specific database tables. “Native XML” databases (which can store any well- 
formed XML document without recourse to restructuring them to suit a different 
underlying storage model) are in the early stages of commercial release, although 
at the time of writing, these systems are restricted in terms of the hardware 
platforms they support and may also be prohibitively expensive. 

A suitable compromise for many organisations will bo to extend their cur- 
rent storage solutions to accommodate XML, in order to leverage the stability 
and scalability of their existing platforms and take advantage of their in-house 
expertise. Relational databases are a particular case in point, not least because 
organisations may wish to link information from relational sources into XML 
documents. 

A number of studies have proposed strategies for storing XML data in rela- 
tional databases; see, for example, [7,8]. Florescu and Kossmann measured query 
performance for a variety of mapping schemes, in which XML data is modelled 
as edge-labelled graphs; for the purposes of their study it was not necessary 
to maintain the distinction between attributes and sub-elements. Their models 
were designed to maximise query performance, with less emphasis on facilitating 
efficient updates and .serialisation of whole documents. 

Our proposed architecture can accommodate data from any well-formed XML 
document, even if schema information is not available, and maintains the dis- 
tinction between attributes and sub-elements. The architecture facilitates pro- 
grammatic traversal and manipulation of XML node trees through an API that 
exposes methods which conform to the DOM recommendation of the W3C. 



2.1 Persistent DOM (pDOM) 

This section outlines our XML repository architecture. Document loading is 
achieved in two stages using a SAX parser. The first pass ensures the XML is 
well-formed prior to any database activity. The second pjtss involves the following 
activities: 

1. A logical database transaction is commenced. The document loading process 
is undertaken as a single transaction to ensure that is succeeds or fails as a 
single unit. 

2. Stored procedures in the database are invoked to insert data into the appro- 
priate tables. 

3. Upon successful completion, the transaction is committed. 

The first action of the parser-instantiated database transaction is an insert to 
the doc table. The document is given an arbitrary numerical identifier doc.id. 
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which is stored along with the URI of the source document and some version 
control information. 

Node identifiers (an arbitrary node.id plus the x and y indices), the node 
type, the depth of the node in the tree, and a pointer to the x value of the node’s 
nearest neighbours are stored in the node table. Element names and text are 
stored in node.leaf (namespaces are stored separately in node_namespace_leaf ). 
Attribute information, namespace, name and value are stored in the four tables 
with names prefixed with attribute-. 

In the tables suffixed with leaf., the leaf-text field has a varchar(255) 
datatype. Although many RDBMS platforms support longer varchar fields, even 
free text objects, the 255 byte character field represents the “lowest common 
denominator” . Since XML element and attribute names, namespaces, values and 
text can exceed 255 characters, there is a paging mechanism; data is broken up 
into a linked list of 255 character pages (each identified by a sequential leaf .id). 
A full listing of tables and columns is provided in table 1. 



Table 1. XML repository tables; primary key columns are indicated by ♦. 



Static tables (column list. . . ) 

node.type (node.type.id* , description, left.delimiter , right.delimiter) 



Dynamic tables (column list. . . ) 

doc (doc-id*, root_nodo_id, source, date-loaded, contributor. id) 
node (node.id*, x.index, y. index, node.type, owner. doc.id, depth, 
parent.node.id , prev.sibling-node.id, next.sibling.node.id, 
f irst-child-node.id, node.size) 
element jiamespace.leaf (node.id*, leaf. id*, leaf. text) 
element jiame.leaf (node.id*, leaf. id*, leaf. text) 
attribute (node.id*, attribute. id*) 

attribute.namespace.leaf (node.id*, attribute.id*, leaf. id*, leaf .text) 
attribute.name.leaf (node.id*, attribute.id*, leaf .id*, leaf. text) 
attribute.value.leaf (node.id*, attribute.id*, leaf. id*, leaf .text) 
text.leaf (node.id*, leaf. id*, leaf. text) 
comment. leaf (node.id*, leaf .id*, leaf. text) 
entity.reference.leaf (node.id*, leaf. id*, leaf. text) 
pi.data.leaf (node.id*, leaf .id*, leaf. text) 
pi.target.leaf (node.id*, leaf .id*, leaf. text) 



To facilitate programmatic access to documents stored in the repository (or 
components thereof), a pDOM Java class has been developed. A pDOM object 
connects to the repository when it is instantiated, and disconnects when it is 
destroyed. The pDOM API provides DOM-compliant methods; the methods ad- 
dress nodes in the repository database, whereas DOM parser methods typically 
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address nodes in memory. The pDOM API will allow system developers to scale 
up DOM-based solutions simply by loading documents into the repository and 
instantiating pDOM objects instead. 



2.2 Current and further work 

Current work is focussed on the following issues; 

1. Performance enhancements, including improved text storage capabilities, 
and strategies for priming the RDBMS cache with nearest-neighbour nodes. 

2. Methods for handling concurrent document manipulation. 

3. Standards-based querying capabilities. 

4. Server-side code using XSL transformations (XSLT) to convert XML data 
from the repository into other formats. 

5. Validation of documents with DTDs and XSDs stored in the same (or com- 
panion) repositories. 
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Abstract, Caption graphically superimposed in news video frames can 
provide important indexing information. The automatic extraction and 
recognition of news captions can be of great help in querying topics 
of interest in a digital news video library. To develop such a system 
for Chinese news video, we present algorithms for detection, extraction, 
binarization and recognition of Chinese video captions. Experimental 
results show that our caption processing scheme is e ective and robust 
and signi cantly improves video caption OCR results. 



1 Introduction 

The ongoing proliferation of digital image and video databases has led to an in- 
creasing demand for systems that can query and search large video databases effi- 
ciently and accurately. Manual annotation of video is extremely time-consuming, 
expensive, and unscalable [1] . Therefore, automatic extraction of video descrip- 
tions is desirable in order to annotate and search large video databases. Text 
present in video frames is a valuable description of video content information. 
Automatic extraction and recognition of video text can provide an efficient ap- 
proach to systematically label and annotate video content. 

The automatic extraction of caption text in the video frames has attracted 
much attention in content-based information retrieval. Some practical systems 
have been constructed for VOD [2-4]. However, current research almost exclu- 
sively focuses on extraction and recognition of English texts. Little work can 
be found on Chinese characters, which has very unique structures compared to 
English characters. In this paper, we use Chinese news video as a test-bed to 
address the problem of Chinese caption extraction and recognition. 

Compared with OCR from document images, caption extraction and recog- 
nition in video presents several new challenges [3]. First, the caption in a video 
frame is often embedded in complex backgrounds, making caption extraction and 
separation difficult. The second problem is low resolution of the characters, since 
most video caption characters are made fairly small to avoid occluding interest- 
ing objects in the frame. Lastly, the low resolution character image is further 
degraded by lossy compression scheme typically used for video compression. 

With these problems in mind, we have designed a system for extraction and 
recognition of Chinese caption in news video programs. The system consists 

K.S. Leung, L.-W. Chan, and H. Meng (Eds.): IDEAL 2000, LNCS 1983, pp. 425 ^30, 2000. 
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Fig. 1. The partition of a frame. 



Fig. 2. The distribution of blocks. 



of four modules, caption detection, caption extraction, character binarization, 
and Chinese OCR. We use the news program of Hong Kong to evaluate the 
performance of our system. 



2 Caption Detection 



According to a priori knowledge of news video, the captions always position in 
the center of the screen. We need only look at the central section of the video 
frame to detect whether a caption line appears. For the news video of Hong 
Kong TVB station, each frame is digitized to 288 x 352 pixels and the size of the 
character ranges from 14 x 14 to 20 x 20 pixels. To detect whether a caption line 
exists in the central part of the frame, we first divide the central section of the 
frame into 24 blocks as shown in Fig.l. Let the central part of each frame for 
a given video sequence denoted by FI, t = 1, 2, • • • , T, each block is represented 
as i = 1, 2, • • • , 24. The task is then trying to find the blocks that contain 
caption text. 

Chinese characters primarily consist of four types of basic strokes, i.e., hor- 
izontal stroke, vertical stroke, up-right-slanting and up-left-slanting stroke [4]. 
These stroke segments contain rich high frequency energy in the four directions. 
To extract the directional frequency information in each block, we use a single- 
level wavelet transform to decompose the image segment into four directional 
components. They provide approximation to the original image block and the 
details in horizontal, vertical and diagonal directions. We use the Haar wavelets 
since they are computationally efficient and are suitable for local detection of 
line segment [2]. 

For each smoothed block Bl,i = 1, 2, • • • , n of a given video frame, the single- 
level Haar wavelet decomposes the image into four subbands. 



Wh : Bl 



( A\{u,v) 

\Vt^(u,v) Dl{u,v) ) 



( 1 ) 



where Al(u, v) is the approximation of the block image B]:(x, y), and HI, and 
Dl are the details of the image in horizontal, vertical and diagonal directions 
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respectively. For each M x N subband image, we calculate the average energy 
as the feature value, 

= d^{H,V,D} ( 2 ) 

U V 

In addition, we also compute the conventional edge feature El{B) directly from 
the binary edge map of the original block Bl- El(B) is calculated by counting 
the edge pixel number of the edge map. 

Since the caption regions contain line strokes in all the four directions, they 
have large values for all the features simultaneously. While, for the background 
blocks, only a subset of the features has relatively large values, indicating cer- 
tain edge like structures in a particular direction. To obtain a more stable and 
distinguishable feature, we integrate the four features into a single feature by 
the multiplication of the four features, 

Vl= n * = l,2,---,n, t = l,2,---,T (3) 

de{H,V,D,B} 

After selecting the features, we use a simple second order statistical classifier 
to identify the caption blocks. Let the caption block and non-caption block 
denoted by <Pc and respectively, a given block Bl with feature value tjI is 
classified as or <Pn according to the following rule, 

fJ-c\\/(Tc<\\vi- fJ-nW/ffn 
■ * if ii?7* -/icil/cTc > 

where fj,c and fj,„ are the means and ac and a„ are the variances of the two 
categories, and || • || is a certain norm. 

Since we do not assume to know the exact location and height of the caption 
line, the block height is chosen to be small enough that at least one block is 
fully occupied by captions. Some blocks straddle between caption lines and non- 
caption space. We put these blocks in a new category, semi-caption category <Pg. 
Therefore we have three categories. With 7200 training blocks, we obtain the 
sample distribution of three categories as shown in Fig. 2. The blocks labeled as 
semi-caption and caption will form the potential caption regions. 

3 Caption Extraction 

After the caption and semi-caption blocks are identified, they are merged into 
caption regions. The next step is to locate and separate caption lines and indi- 
vidual characters. 

In traditional document OCR, the text lines are separated by horizontal 
projections. However, this technique is not suitable for video captions, since a 
video frame usually contains very complex background. To reduce the influence 
of the complex background, we use a set of mathematical morphology operations 
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Fig. 3. The located characters. 




(a) (b) (c) ((5 (e) 



Fig. 4. An example of character bi- 
narization. 



to detect the contour of the characters. Assuming that a binarized caption image 
is denoted by the contour map of Rc is extracted by 

e{Rc) =Rc-Rc°S, RcoS={RceS)®S (5) 

where S is an isotropic structuring element, Rc° S is opening of Rc by S, which 
is dehned by an erosion operator followed by a dilation operator. A smoothing 
hlter is first applied on the contour map to eliminate extraneous fragments and to 
connect broken character segments. Then a simple threshold can locate caption 
boundaries in the horizontal projection of the smoothed image. 

For the obtained caption line, a vertical projection can be used to locate 
the individual characters. However, it is difficult to select a proper threshold to 
locate all the characters. A high threshold will lead to character loss, while a low 
threshold will lead to character merging. Our strategy is hrst to guarantee the 
low character loss rate, then to re-segment the merged characters with heuristic 
rules. Fig. 3 shows the located characters marked with boxes. We see that except 
a false character, all the characters are located exactly. The false character will 
be excluded in the recognition step. 



4 Character Binarization 

Although we have obtained the individual characters, they can not be fed into 
an OCR classifier directly, since the extreme low resolution is insufficient for 
recognition and the character is still blended with a complex background. Before 
separating the character from the background, we first increase the resolution of 
the character image by a factor of four through interpolation. Even though this 
does not add new information to the gray scale image, it does help to smooth 
the characters in the binary image. 

We show the character binarization steps through an example character 
shown in Fig. 4. Fig. 4 (a) gives a typical character in a non-uniform background. 
Using a spline interpolation function, we obtain the image with higher resolution 
in (b). We then binarize the interpolated image with a hxed threshold to get the 
binary image in (c) . Apparently, some background regions still remain in the bi- 
nary image. Fortunately, a bright character on a bright background always has 
a black profile around it to help the audience to read the character. Based on 
this observation, a region connectivity analysis scheme is proposed to eliminate 
the remaining background residues. 
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Because of background noise the character often connects the background 
residues with small bridges even though there is a black profile around the 
character. To break the bridges across the character and the background, we 
adopt a set of morphological processing techniques, the opening operation and 
H-break operation. The opening eliminates small noise particles, while preserves 
the global shape of the objects. The H-break operation mainly eliminates the H- 
connected pixels. Both of them are effective in cutting off the connection between 
the character and the background. 

After the morphological operations, we label all the connected- components 
in the binary image. We then remove a connected component that is too small 
in size. If a connected component connects the periphery of the character image, 
it is also declared as background and filtered out. Through these processes, we 
finally obtain the binary character with clear background shown in Fig. 4(e). 

Since there exist several successful commercial OCR packages, we do not 
intend to implement a new OCR system. After a clear binary character is ex- 
tracted from the video frame, we use the OCR classifier, TH-OCR LV, for final 
character recognition. Our purpose is to test whether our character extraction 
methods can efficiently obtain binary characters clear enough for regular printed 
optical character recognition. 

5 Experimental Results 

We evaluate the system using the TVB news programs aired by the Hong Kong 
Jade station. Video data is encoded in MPEGl format at 288 x 352 resolution. 
Three 30-mintue programs are used in the experiments. 

Evaluation of caption detection: Through the training process, we first get 
the parameters of classifier, then classify 1272 test image blocks. The obtained 
confusion matrix for the caption detection is shown in Table 1. The results are 
very good in terms of separating non-caption from caption and semi-caption 
parts, which is the main purpose of this classiftcation. 
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86.11% 


15.27% 
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90.27% 


18.75% 


3 


92.36% 


20.83% 



Table 1. The confusion matrix of Table 2. The OCR results with and 

caption detection. without preprocessing. 

Evaluation of caption extraction: The second data set is used to evaluate the 
caption extraction. The caption extraction includes the caption line separation 
and individual character separation. We get the precision rate of 93% for all the 
extracted characters. Although some false characters are extracted they can be 
further excluded by the following OCR procedure. Fig. 5(a) presents an example 
of the characters extracted from news video. Looking at the 4 false characters, 
we ftnd that they often have similar features as real characters. 
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Fig. 5. An example of extraction, binarization and recognition of video characters. 



Evaluation of character binarization and OCR: Using our algorithms, we 
binarize the extracted individual character from the complex background. An 
example is shown in Fig. 5(b). Although several small strokes in some charac- 
ters are lost in the process, most characters are correctly binarized. Finally, we 
obtain average recognition rate of 15% for characters without any preprocess- 
ing and 86% for the binary characters extracted using our algorithms. Fig. 5(c) 
shows an example of recognition result. Table 2 shows the comparison between 
with and without our preprocessing for the correct rates of first one, two and 
three candidates. The results show that the preprocessing improves the correct 
recognition rate significantly. 



6 Conclusion 

We have presented an effective Chinese caption processing system. Overall OCR 
rate is improved significantly through our processing scheme. With this system, 
one can realize the automatic annotation of news video and provide indexing 
text file for a news retrieval system. Accurate video OCR is valuable not only 
for conventional video libraries, but also for other new types of video content 
understanding applications, such as matching faces to names and identifying 
advertisements. 
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Abstract. In this paper, we present advanced algorithms to reduce the 
computation cost of block matching algorithms for motion estimation in video 
coding. Advanced Multilevel Successive Elimination Algorithms are based on 
Multilevel Successive Elimination Algorithm for Block Matching Motion 
Estimation [1] and Successive Elimination Algorithm [2], Advanced Multilevel 
Successive Elimination Algorithms consist of three algorithms. The second 
algorithm is useful not only for Multilevel Successive Elimination Algorithm but 
also for all kinds of block matching algorithms. The efficiency of the proposed 
algorithms was verified by experimental results. 



1 Introduction 

There is considerable temporal redundancy in consecutive video frames. Motion 

estimation and compensation techniques have been widely used in image sequence 

coding schemes to remove temporal redundancy. The accuracy and efficiency of motion 

estimation affects the efficiency of temporal redundancy removal. 

Motion estimation methods are classified into two classes of block matching 

algorithms (BMA)[3] and pel-recursive algorithms (PRA)[4]. Due to their 

implementation simplicity, block matching algorithms have been widely adopted by 

various video coding standards such as CCITT H.261, ITU-T H.263, and MPEG. In 

BMA, the current image frame is partitioned into fixed-size rectangular blocks. The 
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motion vector for each block is estimated by finding the best matching block of pixels 
within the search window in the previous frame according to matching criteria. 

Although Full Search (FS) algorithm finds the optimal motion vector by searching 
exhaustively for the best matching block within the search window, its high computation 
cost limits its practical applications. To reduce computation cost of FS, many fast block 
matching algorithms such as three step search, 2-D log search, orthogonal search, cross 
search, one-dimensional full search, variation of three-step search, unrestricted center- 
biased diamond search, and circular zonal search have been developed. As described in 
[5], these algorithms rely on the assumption that the motion-compensated residual error 
surface is a convex function of the displacement motion vectors, but this assumption is 
rarely true [6]. Therefore, the best match obtained by these fast algorithms is basically a 
local optimum. In other words, most fast algorithms reduce computation cost at the 
expense of the accuracy of the motion estimation. 

Without this convexity assumption. Successive Elimination Algorithm (SEA) 
proposed by Li and Salari [2] reduces the computation cost of the FS. To reduce the 
computation cost of SEA, X. Q. Gao etc. proposed Multilevel Successive Elimination 
Algorithm (MSEA) [1]. This paper presents Advanced Multilevel Successive 
Elimination Algorithms (AMSEA). The motion estimation accuracy of AMSEA is 
identical to that of FS and the computation cost of MSEA is reduced by using AMSEA. 



2 Multilevel Successive Elimination Algorithm 

Let fc(i,j) and fp(i,j) denote the intensity of the pixel with coordinate (i,j) in the current 
frame and the previous frame respectively. Assume that the size of a block (Y 
component of the macro block in H.263) is NxN pixels, the search window size is 
(2M+l)x(2M-i-l) pixels, and the matching criteria function is Sum of Absolute 
Difference (SAD) which is the distortion between two blocks. Let and Bp*'*’'"''* 
denote the target block and compared block in the current frame and the previous frame 
with the top left comers at (i,j) and (i-x, j-y) respectively. Bc*‘'^*isthe target block which 
requires motion vector. 
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Be*'’-'* (m,n)= fc(i+m, j +n) ( 1 ) 

(m,n) = fp(i-x+m, j-y+n) (2) 



where x and y represent two components of a candidate motion vector, -M< (x,y) <M 
and 0< (m,n) <N-1. The SAD between the two blocks is defined as: 

N-l N-1 

SAD(x,y) =11 |B/J>(m,n)-Bp<‘J-’‘'5'>(m,n)| (3) 

m=0 n=0 

The goal of motion estimation is to find the best pair of indices (x,y) so that the 
following sum of absolute difference is minimized, as follows: 

d = min SAD(x,y) (4) 

x,y 

Applying mathematical inequality | ||X||i - ||Y||i | < ||X - Y||i [7] for X= and Y= 
Bp<‘j-’‘-y> gives 



I R-M(x,y) I < SAD(x,y) 



(5) 



N-l N-l N-l N-l 

where, R = ||Be‘‘’j*||i =1 1 (m,n), M(x,y) = ||Bp“’j’’‘'>'’||i =1 1 Bp‘‘’j'’‘'>'\m,n), 

m n m n 

N-l N-l 



S AD(x,y) = ||Be“’j* - Bp‘‘’j'’‘->'*||i = 11 |Be‘‘'j*(m,n) - Bp“*’‘'^’(m,n) 



R and M(x,y) are sum norms and are pre-computed using the efficient procedure 
described in [1]. 

In MSEA, each block is partitioned into several sub-blocks. First, the block is 
partitioned into four sub-blocks with size N/2 x N/2. Then each sub-block is partitioned 
into four sub-blocks with size N/4 x N/4. This procedure can be repeated until the size of 
the sub-blocks become 2x2. The maximum level of such partition is L^=\og 2 ^-l for the 
blocks with size N x N. The MSEA with L-level partition, 0< L <L„ax, is called L-level 
MSEA, and the SEA[2] corresponds to the 0-level MSEA. 

At /th level of the L-level partition, where 0 < I <L, the number of the sub-blocks is S;= 
2^^, and each sub-block is of size N; x N/, where N; = N/2*. If we denote 
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2'-l 2'-l 

S AD_SB, = Z Z Ir/"'"' - M/“-''\x,y)| (6) 

u=0 v=0 

N,-l Nrl N,-l N,-l 

where = Z Z Be(m+uN,, n+vN,), (x,y)= Z Z Bp(m+uN,, n+vN,) 

tn=0 n=0 m=0 n=0 

[1] shows that equation (5) can be expressed as equation (7) for L-level MSEA. 

SAD_SBo=|R-M(x,y)|<....<SAD_SB,_i<SAD_SB, <....< SAD(x,y) (7) 
where 0 < 1<L. 

L-level MSEA procedure is as follows: 

1 select initial candidate motion vector within the search window in the previous frame. 

2 calculate SAD at the selected point, current minimum SAD(curr_min_SAD) = SAD 

3 select another candidate motion vector among the rest of the search points 

4.0 calculate the SAD_SBo at the selected search point 
if (curr_min_SAD < SAD_SBq) goto 7 

4.1 calculate the SAD_SBi at the selected search point 

• if (curr_min_SAD < SAD_SBi) goto 7 

• 

4.L calculate the SAD_SB^ at the selected search point 
if (curr_min_SAD < SAD_SBi) goto 7 

5 calculate the SAD at the selected search point 

if (curr_min_SAD < SAD) goto 7 

6 curr_min_SAD=SAD 

7 if (all the search points in the search window are not tested?) goto 3 

8 minimum SAD=curr_min_SAD, calculate motion vector 

The MSEA speeds up the process of finding the motion vector by eliminating 
hierarchically impossible candidate motion vectors in the search window before their 
SAD calculation. 
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3 Proposed Advanced Multilevel Successive Elimination Algorithms 

3.1 Further Computational Reduction 

Partial distortion elimination (PDE) [8] is an effective speedup technique used in 
vector quantization to find the best reconstruction vector from a set of vector code- words. 
PDE technique in SAD calculation is the strategy employed in the full search performed 
in the Telenor implementation of Test Model 1.4a distributed as part of the H.263 
standardization effort. 

In MSEA, PDE technique was used in SAD calculation. But, PDE technique was not 
used in SAD_SB; calculation. So, PDE technique can be used with the L-level MSEA to 
reduce the computations further where SAD_SB, must be computed. In equation (6), if 
at any sub-blocks the partially evaluated sum exceeds the current minimum SAD, the 
point (x,y) cannot be the optimum motion vector and the remainder of the sum does not 
need to be calculated. We denote our PDE technique used in SAD_SB; calculation as 
PDEsb to differentiate from PDE technique used in SAD calculation (symbolized: 
PDEsad). While it is not efficient to test the partial sum against the current minimum 
SAD every time an additional term is added, a reasonable compromise is to perform the 
test after each sub-block row as shown in table 1. In 3-level MSEA, a block consists of 8 
sub-block rows and each sub-block row consists of 8 sub-blocks. So, 8 times PDE test is 
executed in SAD_SB 3 calculation. 

The standard video sequence, “salesman. qcif’, was used in experiment and we tested 
100 frames of the sequence. The block size was 16x16 pixels (N=16). The size of search 
window was 31x31 pixels (M=15) and only integer values for the motion vectors were 
considered. All our experiments (3.1, 3.2, 3.3) were executed under these conditions. 

Experimental results are shown in table 1 and table2. In table 1, “x row” means that 
the PDEsb test is executed at each x-th sub-block row and simple notation MSEA/, means 
L-level MSEA. In tables, “m.e.” means matching evaluation that require SAD 
calculation, “avg. # of rows” means the number of calculated row in SAD calculation 
before partial distortion elimination. Overhead(in rows) is the sum of all the 
computations such as the sum norm computations by using the efficient method 
described in [1], the computations of step 4, the computations of PDEsb etc. but except 
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for SAD calculation. In tables, “in rows” means that the computations are represented in 
order of 1 row SAD computations. It is important to notice that with the MSEA, the 
efficiency of the procedure depends on the order in which the candidate motion vectors 
are searched, and that the most likely candidates should he tested first. This eliminates 
the maximum number of candidates. In our experiment, we used spiral search pattern to 
find the motion vector. 

The method MSEA+PDEsb, which incorporates the MSEA with PDEji,, reduces the 
computations of MSEA by 0.4%, 2.8%, 6.2% for MSEAi, MSEAz, and MSEA 3 
respectively for this fairly large search window. 



Table 1 . Computations of the MSEA 3 +PDEsb for several PDEsb test points 



Algorithm 


Avg. # of 
m.e./frame 


Avg. # of 
rows/m.e. 


Overhead 
(in rows) 


Total 
(in rows) 


MSEAs+PDEsbil row) 


242.3 


14.39 


27,479.6 


30,966.3 


MSEA3+PDE*(2 row) 


242.3 


14.39 


27,719.9 


31,206.6 


MSEA3+PDE*(4 row) 


242.3 


14.39 


28,277.9 


31,764.6 



Table 2. Computations of the FS algorithm and MSEA with PDEsb 



Algorithm 


Avg. # of 
m.e./frame 


Avg. # of 
rows/m.e. 


Overhead 
(in rows) 


Total 
(in rows) 


Computations 

reduction 


FS 


77,439.0 


16.00 


0.00 


1,239,024.0 




FS+PDEsad 


77,439.0 


4.36 


6135.0 


343,596.8 




MSEAo(SEA) 


24,995.6 


5.79 


8653.9 


153,378.4 




MSEA, 


8,769.7 


7.84 


14,851.5 


83,605.9 




MSEAi+PDEsb 


8,769.7 


7.84 


14,530.3 


83,284.7 


0.4% 


MSEA2 


1,589.1 


10.33 


23,552.8 


39,968.2 




MSEAi+PDEsb 


1589.1 


10.33 


22,434.5 


38,849.9 


2.8% 


MSEA3 


242.3 


14.39 


29,510.6 


32,997.3 




MSEAj+PDEsb 


242.3 


14.39 


27,479.6 


30,966.3 


6.2% 



3.2 Adaptive SAD Calculation 

SAD calculation requires very intensive computations. SAD calculation must be done 
at many matching evaluation points per frame as shown in table 2. To find the method 
reducing the computations of SAD calculation, we investigated absolute difference 
values of |Bc*‘’^*(m,n) - Bp*‘*’‘’''*(m,n)| for 0 < m,n < N-1. There are 256 absolute 
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difference values at each block. These values range from 0 to maximum pixel intensity 
value. One outstanding feature is that large absolute difference vales are centered 
together. This property is shown in table 3. If we calculate SAD from large absolute 
difference vales to small absolute difference values, partial distortion elimination in SAD 
calculation can be done very early. So, the computations of SAD calculation can be 
reduced. 

To calculate SAD adaptively for each block, we divided each block into 4 sub-blocks 
with size of 8 x 8 pixels and sampled 8 points at each sub-block and calculated the sum of 
absolute difference for the sampled pixels for each sub block as shown in Fig. 1. 
Sampling points at each sub-block must be chosen to represent the SAD of its sub- 
blocks. The sum of sampled absolute difference values at each sub-blocks are sorted to 
determine the order of sub-blocks to be used in SAD calculation. First of all, we 
calculated SAD at the sub-block of which the sum of sampled absolute difference values 
is greatest, and then we calculated SAD at the sub-blocks following the ordered 
sequence. 




Fig. 1 . Block division and sampling 

As maximum to minimum SAD ratio increases, the partial distortion elimination is 
achieved early in Adaptive SAD Calculation algorithm. Table 3 shows that a 
considerable portion of the blocks have a ratio that is very large. The total blocks of table 
3 that are checked for 100 frames are 7,734,000. Adaptive SAD Calculation 
algorithm(symbolized: SADadap) reduces the computations by 10.6%, 7.2%, 6.2%, 

3.0%, 0.4% for FS-tPDEsad, MSEAo, MSEAi, MSEAp, and MSEA 3 respectively. 
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Table 3. The distribution of blocks according to the ratio of 
(maximum SAD of suh-blocks/minimum SAD of sub-blocks) 



>10 


>8 


>6 


>4 


>2 


>1.8 


>1.6 


>1.4 


>1.2 


>1.0 


2.1% 


1.6% 


3.2% 


9.1% 


41.6% 


10.0% 


11.2% 


11.3% 


8.0% 


1.8% 



Table 4. Computations of Adaptive SAD Calculation algorithm 



Algorithm 


Avg. # of 
m.e./frame 


Avg. # of 
rows/m.e. 


Overhead 
(in rows) 


Total 
(in rows) 


Computations 

reduction 


FS+SADadap 


77,439.0 


3.90 


5,095.4 


307,107.5 


10.6% 


MSEAo+SADadap 


24,995.6 


5.27 


10,631.2 


142,358.0 


7.2% 


MSEA|+SADadap 


8769.7 


7.16 


15,603.2 


78,394.3 


6.2% 


MSEA2+ SADadap 


1,589.1 


9.50 


23,673.2 


38,769.7 


3.0% 


MSEA3+ SADadap 


242.3 


13.87 


29,521.0 


32,881.7 


0.4% 



Adaptive SAD Calculation algorithm can he combined with not only MSEA hut also 
all kinds of block matching algorithms to reduce the computations of motion estimation. 



3.3 Elimination Level Estimation 



As shown in MSEA procedure step 4, if a search point is eliminated at /-level, L-level 
MSEA checks SAD_SBo, SAD_SBi, SAD_SB; sequentially and then eliminate the 
search point at /-level where Q<l<L+\. In Elimination Level Estimation algorithm, 
step4.(L-i-l) is equal to step5 of the MSEA. If we can estimate the elimination level /, 
where 0</<L-i-l, of a search point in L-level MSEA, we can reduce the computations 
because 4.0, 4.1, ... 4. (/-I) procedures are not necessary. 

The estimation result can be classified into five groups as follow: 

easel, hitl: when the estimated elimination level (EEL) is 0 and EEL is equal to the 
practical elimination level(EL) 



case2, hit2\ when EEL is greater than 0 and EEL is equal to EL 

case3, acceptable missl: EEL is 0 and EEL is smaller than EL 

case4, acceptable miss2: EEL is greater than 0 and EEL is smaller than EL 






Advanced Multilevel Successive Elimination Algorithms 439 



case5, non-acceptable miss'. EEL is greater than EL 

There is a profit when the estimation result is hit! or acceptable miss2 cases. Hitl and 
acceptable missl cases cannot reduce computations. There is a loss when the estimation 
result is non-acceptable miss case because it increases computations. The loss is 
approximately 4 times greater than the profit between two levels whose difference is 1 
level. One incorrect estimation loss compensates the profit incurred by 4 times correct 
estimation. So, the estimation method requires a very high level of correctness. To 
achieve this aim, we use a very concrete estimation function as (8). 

In this algorithm, the motion vector search pattern is as shown in fig. 2 which is 
modified from the spiral search pattern. Black circle represents the search point at which 
MSEA is executed and white circle represents the search point that requires elimination 
level estimation. Motion vector search sequence is as the number ordering sequence in 
fig.2. At search point 19, search points 2, 10, 11, 19 are the search points of which the 
elimination levels were calculated and point 20 is the search point which requires the 
elimination level estimation, we denote a search point which requires elimination level 
estimation such as point 20 as an estimation search point. We denote practical 
elimination level of search point 10 and estimated elimination level of search point 20 to 
EL(IO) and EEL(20) respectively. We used equation (8) as estimation function. 

EEL(20)=min(EL(10), EL(2), EL(ll), EL(19)) (8) 

The correctness of this estimation function is excellent as shown table 5, but there is 
estimation overhead. To reduce the overhead of the estimation function, four points are 
divided into two groups { 10, 2} and {11, 19}. Points { 10, 2) are lower left side points of 
point 20 and upper right side points of point 37. Points (11, 19} are upper right side 
points of point 20 and lower left side points of point 42. We can express equation (8) as 
follow: 

EEL(20)=min(lside(20), uside(20)) (9) 

where lside(20)=min(EL(10), EL(2)), uside(20)=min(EL(ll), EL(19)) 

The grouping technique is useful when the following operation is done. If EL(IO) is 
equal to 0, then setting EEL(20), EEL(40), EEL(65), EEL(37), lside(20), uside(37). 
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lside(40), and uside(65) to zero. So, if one of neighbor points is zero then estimation is 
unnecessary. If all of the neighbor points are not zero, estimated elimination level can be 
found easily by equation (9) 




Fig. 2. Motion vector search pattern in search window for Elimination Level Estimation 

algorithm 

As shown in table 5, the hit (hill+hit2) ratio of estimation is 89% and 
(hitl+hit2+acceptable missh- acceptable miss2) ratio is nearly 100% and non-acceptable 
miss ratio is nearly 0%. So, the correctness of estimation function used in this algorithm 
is reliable. As shown in table 6, Elimination Level Estimation algorithm reduces the 
computations of MSEA by 0.3%, 1.3%, 1.9%, 2.1% for MSEAo, MSEAi, MSEAj, and 
MSEA3 respectively. Although the correctness of estimation is reliable, the reduction of 
computations is small because of most blocks are eliminated at level 0 and estimation 
overhead and the loss incurred by incorrect estimation is larger than the profit incurred 
by correct estimation. Fig. 7 shows the number of blocks that are eliminated at each 
when 3-level MSEA is used. The total blocks for 100 frames are 7,734,000. Applying 
this algorithm to a video sequence of which the elimination rate at level 0 is small and 
the elimination rate at upper level is large then the performance of this algorithm will 



increase. 
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Table 5. Elimination level estimation results 



Algorithm 


Casel 


Case2 


Case3 


Case4 


Case5 


MSEA()+EL estimation 


2,245,282 


807,943 


296,204 


0 


171 


MSEAi+ELestimation 


2,245,282 


713,052 


296,204 


94,855 


207 


MSEA2+EL estimation 


2,245,282 


681,906 


296,204 


125,982 


226 


MSEA3+EL estimation 


2,245,282 


681,537 


296,204 


126,351 


226 



Table 6. Computations of MSEA combined with Elimination Level Estimation algorithm 



Algorithm 


Avg. # of 
m.e./frame 


Avg. # of 
rows/m.e. 


Overhead 
(in rows) 


Total 
(in rows) 


Computations 

reduction 


MSEAo+ELestimation 


24,997.3 


5.79 


8,148.9 


152,883.3 


0.3% 


MSEAi +ELestimation 


8,770.1 


7.84 


13,741.5 


82,499.1 


1.3% 


MSEA2+ELetimation 


1,589.2 


10.33 


22,791.3 


39,207.7 


1.9% 


MSEAs+ELestimation 


242.4 


14.39 


28,815.3 


32,303.4 


2.1% 



Table 7. The number of elimination blocks at each level for MSEA3 





Level 0 


Level 1 


Level 2 


Level 3 


at m.e. 


Elimination block 


5,244,345 

(67.8%) 


1,622,582 

(21.0%) 


718,063 

(9.3%) 


134,679 

(1.7%) 


14,331 

(0.2%) 



at m.e.: eliminated at matching evaluation 



4 Conclusions 

Advanced Multilevel Successive Elimination Algorithms, which are improved version 
of the Multilevel Successive Elimination Algorithm, have been proposed to reduce the 
computation cost of block matching algorithms for motion estimation in video coding. 

Further Computation Reduction and Elimination Level Estimation algorithms are only 
useful in MSEA, but Adaptive SAD Calculation algorithm can be combined with not 
only the MSEA but also all kinds of block matching algorithms to reduce computations. 
This AMSEA can provide computations reduction over the MSEA, while keeping the 
same motion estimation accuracy as the FS. 

AMSEA are very efficient solution for video coding applications that require both 



very low bit-rate and good coding quality. 
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News Content Highlight via Fast Caption Text 
Deteetion on Compressed Video 
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Abstract. Captions present in video frames play an important role in 
understanding video content. This paper presents a fast algorithm to 
automatically detect captions in MPEG compressed video. It is based on 
statistics features of caption text’s chrominance components. The paper also 
discusses its principle and speed-up mechanism in detail. We have successfully 
exploited the technique to automatically construct the pictorial catalogue, a new 
content representation. Experiment results show the proposed caption detection 
algorithm has not only the ideal accuracy 96.6% and recall 100%, but also a 
detection speed of faster than real time. 



1 Introduction 

In the field of content-based visual information retrieval (CBVIR), automatic 
detection of high level visual features is a significant topic. Text present in video 
frames plays an important role in understanding video content. In many classes of 
video, text is often overlapped on original natural video, such as news, documentaries, 
etc, to form concise annotation of relevant video clips. Gargi et al [1] call it caption 
text, to distinguish from scene text which occurs naturally in the 3-D scene. 

Some research efforts have been made to detect text in video. [2] proposed an 
algorithm to detect text region in video frames. They first exploited the feature of 
clustered sharp edges for a typical text region, to determine those candidates. Then 
consistent detection of the same region over a certain period of time was applied to 
verify them. The algorithm requires no priori knowledge, completely based on 
analysis of frame images, so it can be applied in all classes of video. But since it 
operates on original images, intensive decode computation results in its relative low 
detection speed. [3] proposed a caption detection algorithm in MPEG compressed 
domain. Based on the assumption that caption appearance and disappearance often 
occur in the middle of a video shot, a similar technique as [4] was applied to compute 
inter-frame content difference in caption regions. At the same time, the algorithm 
identified and ignored the large content difference caused by shot transition, to locate 
caption appearance and disappearance events. But our observation of CCTV news 
shows existence of caption text commonly covers multiple shots. So the algorithm in 
[3] is no longer applicable in the context. In this paper, we propose a fast algorithm 
that automatically detects caption text in MPEG compressed video. Our approach is 
based on statistics features of caption texts’ chrominance components. 
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The rest of the paper is organized as follows. Section 2 first gives a more general 
model about video overlap operation and discusses the principle of the algorithm. 
Then its details are described. Last we address the speed-up mechanism further. In 
section 3, we apply the algorithm to automatically construct a pictorial catalogue and 
experiment results are given. Section 4 concludes the paper. 



2 Caption Text Detection in compressed MPEG video 



2. 1 Overlap Modal of Video objects and Principle of the Proposed Algorithm 



Many multimedia applications involve various manipulations and compositings of 
video signal, including translation, overlap, etc. During the editing stage of production 
or live broadcast, some annotating text is often overlapped on frames in a clip. 
Let , 5 represent a background object and a foreground object before overlapping 
respectively. Many types of overlapping can exist between them. Opaque overlapping 
requires substituting the pixels of the object S for those of the object N , while semi- 
transparent overlapping requires a linear combination of them, based on a 
transparency factor a ( 0 ^ o ^ 1 ), i.e., 

P[/,y] = a^[/,y] + (l-Q)S[/,;] (1) 

where P[i,J ] , N[i,j] ,S[i,J] are the pixels of the new object, the background 
object and the foreground object at position (t,J). When q = 0, we get a result of 
opaque overlapping. 

A more complicated mixture of the two overlap modes aforementioned exists, i.e., 
partial pixels of the object P are produced through opaque overlapping, let these 
pixels form a set A, while the rest pixels are produced through semi-transparent 
overlapping, the corresponding pixels set is ^ , ;t r\A = ^ . We can establish a model 
for this overlapping, as formula (2). 

P[IJ] = a . M[l, j] ■ Nil, /] + (1 - a) • MlUJ] ■ Sll, j] + WJ] ■ 5[/, j] (2) 

where function and w[/,y] are defined as the followingi 






I 

0 









0 

1 



(lJUA 



(3) 



This kind of overlapping is common in CCTV news programs. Text region in a 
caption object S forms the set A , while text background region in the object S 
corresponds to the set n , Here we can consider 5 as a binary image, i.e.. 



S[iJ] = 



(l,J)e A 



(4) 



Where a and b are text and text background colors in the caption object 5 . 
From (4), formula (2) can be converted to (5) 
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n Hl'Jl + (!-«)■* (S) 

In the Y-Cb-Cr color space, a component of each pixel is an integer from 0 to 255. 
So, for a pixel in the region n , its corresponding component value must range from 
(l-Q) fi to {l-a) b + 255a after overlapping, while for a pixel in the set A the 
result is a constant a . Since each pixel in the caption text region n\jA has the 
property, we can expect if we calculate the average v of all the pixels in the region 
TIKjA, then V will also belong to a specific range. Usually when a person selects the 
colors a and b for the object S , they should form a strong contrast against those of 
common natural background objects. 

The above analysis results have been verified by our observation of CCTV news. 



2. 2 Detection of Caption Text 



Applying the algorithm in [4], we can extract a EKD image for each ft^e in MPEG 
video with minimal decoding. The DC in each block is equal to the average of the 
pixels in that block. Through interactively choosing caption text regions as input 
samples, we can obtain the following important information, (i) the caption text 
region T , (ii) the dynamic distribution ranges of DC values in T . (iii) the dynamic 
distribution ranges of the average and standard variance of the DC values in T . 

Since existence of caption text in video generally continues no less than 3 seconds, 
the algorithm detects only all the I frames, which can speed up the detection process. 
The details of the proposed algorithm are described in the following, 

©Initialization. Open the video document^, obtain the sequence number 
CurFrmNum of the first I frame, and the GOP length gl of the MPEG stream. 

©Extract DC images Xh=f^') and Xcc = {xj' } for the chrominance components 
Cb and Cr„ from the frame with the sequence number CurFrmNum . 

© Suppose T represents the caption text region, bij is the region of the element 
whose index is {ij) in DC image Xch and Xcr . Let G*={c^*lc^ =Xj*,fii/cr}' 

Gr={c;\c;=x;,bicT]. Calculate F = {r^,\rf,mg^\sd^\r;'y,',avg",sd"), where 



r/* = min 






avg'*)’ >r,‘ 



I d%Ctk 



= min 



rr ^ Y (rf - avg " . where |c] represents 

\F"LTc.. 

the cardinal number of the set C . 

®lf avg'*6[avg;‘.avg;‘], sd‘'‘ B[sdf ,sdfh ['•/'.'•"Iclvr.vf]- 

ovg" s e [irff t*!®" caption text exists in the frame 

with sequence number CurFrmNum , where [vj* , vf ] , [vf , v" ] , [avg f , ovg f ] , 
[avgY ,a'>gf]<{sdt .id'*], lid " , id f ] t^re the dynamic ranges of different 
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components in the feature vector F , which are obtained through the interactive off- 
iine statisticai process aforementioned. Besides the features in F , relation features 
between two chrominance components' statistical data can also be found and utilized, 
to improve detection accuracy. For example, we observed that in CCTV news there 
exist such relation features avg S a\g " and sd ‘‘ '2. sd” • 

© To eliminate noises, the system declares a caption appearance event, if and only 
if it consistently detects captions in IVs consecutive I frames, and represents the start 
frame of the caption event by STextFrmNum. Similarly, Only when in We 
consecutive I frames, the system does not find captions, the disappearance event will 
be declared, and outputs the detection result (STextFrmNm,ETextFrmNm) , where 
ETextFrmNum represents the end frame of the caption event. Ws and We are system 
parameters, we choose Ws = We =3 in our system. 

® CurFrmNum CurFrmNum + gl . If the end of the document is not met then 
goto (©. Otherwise the file fp is closed, and the whole algorithm exits. 



2. 3 Speed up Detection Process 

In the algorithm aforementioned, frames are sampled according to the granularity of a 
GOP, to detect appearance and disappearance of captions. If we apply a larger 
granularity to sample frames, the number of frames detected will greatly reduce, thus 
the detection speed will increase. Since a caption is used to annotate content of a 
relevant clip, it should stay enough time to make watchers notice it and be impressed, 
especially in news, so we can use the granularity e gl to detect caption events, 
where £ is an integer more than 1 and called a granularity factor. 

If minimal resident time of a caption in a video stream is t , it is easy to prove that 
if the frame sample granularity / < t is used, then at least one frame in each clip 
containing a caption will be checked. So the algorithm will not miss detection of any 
caption due to too large frame sample granularities. Similarly, if minimal time 
interval between two consecutive clips with different captions is // , it will hold that if 

the frame sample granularity /< ,u , then at least one frame in each clip with no 
caption will be checked. So the algorithm will not mistake different captions as the 
same caption due to too large frame sample granularities. Thus when the current 
frame does not contain caption and the system is ready to detect next caption 
appearance event, the granularity factor £i chosen should satisfy £i g/<t, 
1 

i.e., ei< — . Similarly, when the current fr-ame contains a caption and the caption’s 
gl 

disappearance event is to be detected, the granularity factor £2 chosen should satisfy 

£2 < — . From the aforementioned, we know that when the system chooses a sample 
gl 

granularity / to detect caption text, it assumes the t and ^ of a MPEG stream satisfy 
/ < t , I < p . In practical applications, the assumption is easy to be satisfied, since 
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some priori knowledge can help us to choose a suitable /, or a friendly human- 
machine interface can be designed to flexibly set the parameter / . 



3 Experiments 

Owing to the constraint of Internet bandwidth, people hope computers can extract 
minimal video data to summarize essential content of the corresponding video 
document, to alleviate network burden and reduce cost of network traffic. Many 
abstraction representations of video content have been proposed, such as storyboard, 
scene transition graph, pictorial summary, etc. The paper suggests a new form to 
present essence of news content, called pictorial catalogues. It is a group of frame 
images arranged in a time order and each image contains a caption to annotate a news 
item. Its construction is based on detection of caption text in news video. 

Our observation of CCTV news shows that almost each news item clip contains a 
caption overlapped on some consecutive frames to annotate the clip. Usually, CCTV 
News lasts half-hour and includes about 20 news items. So using a pictorial catalogue 
is attractive, since it can present a user only about 20 images to make him understand 
the essential news content of that day well. 



Table 1. The Experiment Results of Detecting Captions over the Test Set 



Sequences 


Number 


Total 


Output of 


False 




of frames 


captions (S) 


detection (D) 


wKsm 


Hn; 


NewsO 


44965 


19 


20 


1 


0 


NewsI 


44770 


22 


22 


0 


0 


News2 


45106 


21 


23 


2 


0 


News3 


44913 


19 


19 


0 


0 


News4 


36339 


17 


19 


2 


0 


NewsS 


59610 


18 


18 


0 


0 


News6 


44447 


16 


16 


0 


0 


News? 


49775 


12 


12 


0 


0 


Sum total 


369925 


144 


149 


5 


0 



We use 8 days’ CCTV news MPEG streams as a test set, to verify the validity of 
the proposed algorithm. All experiments are done on the PC with a PIIl-450 CPU and 
64M memories. The MPEG streams have a frame rate of 24 f/s with the frame 
dimension 720 X 576 pixels. The whole test set contains 369925 frames and lasts 
about 4.5 hours. The experiment results are tabulated in table 1. Based on the statistics 

E 5 

in table 1, the accuracy of detecting captions = — = 1--— = 96.6% and the 

recall /? = 1 - — = 1 — ^ = 100% can be calculated. It is ideal that the algorithm has 
5 144 

the recall 100% and high accuracy. Because it is easy work for a user to select few 
images not containing caption from a small number of images through an interaction 
tool. Figure I gives a small part of a pictorial catalogue generated by the algorithm for 
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news3. Table 2 tabulates the system’s runtime of partial streams under the different 
granularities. Though the decoding engine embedded in the system has only 
performance of 8 f/s over the MPEG-2 streams, table 2 implies our algorithm has the 
performance of faster than real time. This is mainly because the whole detection 
process completely operates in the compressed domain and some mechanisms of 
speed up are introduced in the system. 




Fig. 1. An Example of Pictorial Catalogues (News3) 



Table 2. Detection Time of Partial Streams under the Different Granularities 



Sequences 






m 


m 


here g represents 
the granularity 
factor. 


News5 


615 


344 


240 


191 


News6 


370 


206 


148 


109 


News? 


483 


286 


171 


134 



4 Conclusions 

The paper proposes a fast caption detection algorithm in the compressed domain. We 
successfully applied the technique in constructing pictorial catalogues of CCTV news 
video. Our experiment results show the proposed algorithm has not only the ideal 
accuracy and recall, but also very fast detection speed. So it has significant 
application value. Its disadvantage is that it needs to establish the color distribution 
model of caption through an off-line interaction tool. So it is suitable to be applied in 
news video, since many properties of caption text have relative stability. 
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Abstract. This paper proposes texture-based text location methods with a 
neural network (NN) and a Support Vector Machine (SVM). Both a NN and an 
SVM are employed to train a set of texture discrimination masks for the given 
texture classes: text region and non-text region. In these two approaches, 
feature extraction stage is not used as opposed to most traditional text location 
schemes, and discrimination filters for several environments can be 
automatically constructed. Comparisons between NN/SVM-based text location 
methods and a connected component method are presented. 



1 Introduction 

Text location has been studied for many applications including document analysis, 
text-based video indexing, multimedia systems, digital libraries, and vehicle number 
plate recognition [2-5,7, 8]. Recently, researchers have proposed various methods for 
the text-based retrieval of video documents. Automatic text location for video 
documents is very important as a preprocessing stage for optical character 
recognition. However, text region location in complex images may suffer from low 
resolution of characters unlike black characters in white documents [2]. 

There are two primary methods for text location. The first method uses a connected 
component analysis [5,8]. This is not appropriate for video documents because it is 
based on the effectiveness of the segmentation which guarantees that a character is 
segmented as one connected component separated from other objects. On the other 
hand, texture -based methods are based on the observation that texts in video images 
have distinct textural properties, and use Garbor filters, wavelet decomposition, and 
spatial variance [4,6,7]. In this method, the text location can be posed as a texture 
classification problem where the problem-specific knowledge is available prior to the 
classification. This use of texture information for text location is sensitive to character 
font size and style. Therefore in a complex situation with various font styles, sizes, 
and colors, it is difficult to manually generate a texture filter set. 

This paper presents texture -based text location methods using an NN and an SVM, 
which are employed to train a set of texture discrimination masks for the given texture 
classes: text region and non-text region. An NN and an SVM have several similar 
aspects. These two approaches use no feature extraction stage as opposed to most 
traditional text location schemes, and discrimination filters for several applications 
can be automatically constructed. Comparisons between NN/SVM-based texture 
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discriminations and a connected component method are presented. As shown in the 
experimental results, NN and SVM have a trade-off in time and accuracy. 

The remainder of the paper is organized as follows. Chapter 2 describes the 
proposed NN-based method and SVM-based method. Experimental results are shown 
in Chapter 3. Chapter 4 presents some final conclusions and outlines future work. 



2 Texture-based Text Location Methods 

This chapter presents NN/SVM-based text location methods and compares the 
performance with connected component method. The proposed text location system 
operates in two stages: First, it applies a NN or an SVM to classify the pixels in the 
input image, that is, feature extraction and the pattern recognition stages are 
integrated in those classifiers. The classified image is a binary image in which the 
pixels classified as text are black and those classified as non-text are white. Then it 
post-processes (eliminates noises then places bounding boxes) filtered outputs. 



2.1 NN-based Text Locations 

This section presents a NN-based text location method. A neural network is used to 
classify the pixels of input images. To this end, the NN-filter is applied at each 
location in the image. That is, the neural network examines local regions looking for 
text pixels that may be contained in a text region. 



Output 

Layer 



Hidden Layer 2 



Hidden Layer 1 




input 

Layer 




7 a neural 

network for 
intensity value 



Fig. 1. A three-layer feed-forward neural network [6] 



An input image is segmented into text and non-text classes using a multi-layer feed- 
forward neural network classifier which receives the intensity values of a given pixel 
and its neighbors as input. The activation values of the output node are used to 
determine the class of a given central pixel. A classified image is obtained as a binary 
image in which the text pixels are black [4]. Fig. 1 describes the architecture of the 
neural network-based classifier. Adjacent layers are fully connected, the nodes on the 
hidden layer operate as a feature extraction masks, and the output layer is used to 
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determine the class of a pixel: text or non-text. The input layer receives the intensity 
values of the pixels, at predefined positions inside an MxM window over an input 
frame [6]. To investigate the properties of the masks, the frequency response of these 
masks must be demonstrated. To illustrate the frequency responses of the hidden 
nodes, the image shown in Fig. 2-(a) was considered [6]. Fig. 2-(b)-(e) is the outputs 
of the first hidden layer’s nodes when applied to Fig. 2-(a). For good visualization, the 
output images were smoothed and the contrast of the images enhanced. These images 
show that each hidden node has its special orientation and localized frequency. 




(a) (b) (c) (d) (e) 

Fig. 2. Frequency responses (b)-(e) of hidden nodes when applied to 128x128 image in (a) 



2.2 SVM-based Text Locations 



A Support Vector Machine has been recently proposed as a method for pattern 
classification and nonlinear regression [1]. For several pattern classification 
applications SVMs have been shown to provide a better generalization performance 
than traditional techniques such as neural networks. In this chapter we present a brief 
description for the SVM and the text location method using an SVM. 

Consider a two-class pattern classification problem. Let the training set of size N 
be , where x. £ R’’ is the input pattern for the i th example and d, e {-l,+l} 

is the corresponding desired response. The non-linear SVM first performs a nonlinear 
mapping H ■ Let ^ (x) denote a set of non-linear transformations from the 

input space R’’ to the feature space H. An SVM can be trained to construct a 
hyperplane 'w^^[x)+b = 0 for which the margin of separation is maximized [1]. 
Using the method of Lagrange multipliers, this hyperplane can be represented as: 

f=l 

where the auxiliary variables a, s are Lagrange multipliers. These s can be found 
by solving the following problem: 

Maximize g(d;) = — ^or.cir.d.d.ff(x,.,x^.) 

i=i 2 

where k[x.,Xj) is the inner-product kernel defined by: 

ii:(x.,x.)=0^(x,)4x,)^ 



(3) 
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subject to the constraints: 

N (A\ 

t,O!id.=0, 0<a,<C for = ^ ' 

(=1 

where C is a constant balancing contributions from the first and second terms. This 
performs a similar role with the regularization parameter in the radial basis function 
and affects the generalization performance of SVM. The value of this parameter is 
determined empirically to be 10. 

An SVM is trained to classify each pixel in the input image into text or non-text 
by analyzing the textural properties of its local neighborhood. No texture feature 
extraction scheme is explicitly incorporated. Instead the intensity level values of the 
raw pixels are directly fed to the classifier. The proposed method utilizes a small 
window to scan the image and classify the pixel located in the center of window as 
text or non-text using an SVM. 




ou(]>ut y 



wciglits 



inner product (<I>(x, )• 0(x))- Ar(x, 



mapped vectors <I>(x, )ici>(x) 



support vectors x, x„ 



texture pattern x 



M 



A/ 



•x) 



Fig. 3. SVM architecture for text detection 



Fig. 3 shows a three-layer feed-forward network architecture of an SVM as a text 
detector. The input to the network directly comes from the intensity level values of 
the MxM (typically 13x13) window in the input image. However, instead of using all 
the pixels in the window, a configuration for autoregressive features is used [6]. The 
hidden layer applies nonlinear mapping (p from the input space to the feature space H 
and computes the dot product between its input and support vectors. These two 
operations are performed in a single step by using the kernel function K. For the 
kernel function, a polynomial kernel is used. The sign of the output y obtained by 
weighting the activation of the hidden layer, represents the class of the central pixel in 
the input window. For training, -fl was assigned for the text class and -1 for the non- 
text class. Accordingly, if the SVM’s output of a pixel is positive, it is classified as 
text. To detect texts in an image, the detection window is shifted over all locations in 
the image. As a result of classification, a classified image is obtained as binary images 
in which the text pixels are black. 
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3 Experimental Results 

The database for this experiment was composed of two sets. One included image data, 
and the other video data. The images were all scanned using a Hewlett Packard 
ScanJet 5100C flat-bed scanner. A total of 100 scanned images were used for the 
experiments. The scanned images came from a variety of journals and magazines. The 
resolution of the images was 150 DPI. The proposed text location method was applied 
to 12 video clips. Each video clip had a running time of 2~3 minutes. 1000 key 
frames with a size of 320x240 were automatically selected using a simple key frame 
extraction technique. Of these frames, 50 video frames and several images were used 
in the initial training process, and the others were used in the testing process. This 
paper focuses on super-imposed and horizontally aligned text in video frames and 
complex images. No prior knowledge of resolution, text location, and font styles is 
assumed. However the horizontal alignment of text is required. 




Fig. 4. Example of text location 

Text rectangles are identified by performing a profile analysis and merging certain 
rectangles. Owing to the restriction that texts may be aligned horizontally in input 
images, a simple heuristic method is used to align bounding boxes. Figure 4 shows 
examples of the text location [4]. The detection rates and processing times for a neural 
network and a support vector machine are shown in Table 1, compared with those of 
the connected component analysis method [8]. It is clear that the proposed methods 
exhibited a superior performance than the connected component method. The most 
important texts with sufficiently large font sizes were successfully located by the 
proposed system. 



Table 1. Performance comparisons 



Methods 


Detection rates (%) 


Processing times (sec.) 


NN 


86.3 


3.5 


SVM 


91.2 


7.0 


Connected Component Method 


73.8 


1.2 
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4 Conclusions 



This chapter presents a summary of the proposed text location methods. In addition, 
some continuing problems are noted, which need to be addressed in future work. It 
has been shown that a neural network and a support vector machine can be trained for 
supervised text location in images. The proposed text location method can be 
distinguished from other algorithms by the following properties: (1) discrimination 
filters for several applications can be automatically constructed; (2) the input image is 
directly processed instead of extracting some feature vectors; (3) they allow for the 
easy implementation of a parallel system. The main limitation of the current system is 
its running time. It may be useful to modify the architecture of the filtering network or 
the input window size automatically. The relatively low number of false detections 
shows the excellence of an SVM for text location, however, in real-time application, 
the rather longer processing time of an SVM advocates its use as a second-stage 
processor, which only investigates a doubtful region that has already been identified 
by a faster yet less reliable first-stage text detector, for example a neural network or a 
connected component method. 
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Abstract. Video partitioning is a key issue in video classification that facilitates the 
management of video resources. The video partitioning involves the detection of boundaries 
between uninterrupted segments (video shots). Shot boundaries can be classified into two 
categories, gradual transition and abrupt change. Detection of a gradual transition is considered 
to be difficult. Few methods have been reported for gradual transition detection. In this paper, a 
new approach called Two Measures Two Thresholds (TMTT) is proposed. The method requires 
the use of two measures and consequently two thresholds. By comparing the gray level 
histogram difference of consecutive frames with a smaller Threshold ( Ts ), possible shot 
boundaries are located. Then false boundaries are discarded by comparing their color ratio 
histogram with another threshold that is used to measure the similarity of content of the frames. 
The efficiency of TMTT is promising according to the analysis of some experimental results. 

1 Introduction 

It has become important to archive and access multimedia information in several 
important application fields such as VOD (video on demand), DLI (digital library). Of 
all the media t5qjes, video is the most challenging one, because it combines all the 
other media information into a single bit stream. The most popular method is text- 
based. Unfortunately the content of the videos is so abounded that the key words can 
not express all information. So it is urgent to browse and retrieve video sequences 
directly by their content. 

The primary task before browsing and retrieval is to systematize video that has an 
obvious structure hierarchy, i.e. video, shot, and frame. A video is made up of shots; 
a shot is an uninterrupted segment of screen time, space and graphical configurations. 
According to the duration of shot boundaries, there are two types: camera breaks and 
gradual transitions. Basically there are two kinds of gradual transition: wipe and 
dissolve. A wipe is a moving boundary line crossing the screen such that one shot 
gradually replaces another; a dissolve superimposes two shots where one shot 
gradually lightens while the other fades out slowly. Detection of camera breaks has 
received considerable attention and it has been done very successfully. Unfortunately 
the detection of the gradual transition is rather unsatisfactory. Yeo and Liu suggested 
looking for the “plateau” [1] and Zhang et al proposed a twin-comparison algorithm 
[2] to search the gradual transitions. Step-variable algorithm was proposed by Wei 
Xiong and John Chung-Mong Lee [3]. Lifang Gu suggested a linear model based 
method to detect dissolves [4], Model based video segmentation is suggested in our 
paper [5]. In another paper, 1 proposed an algorithm called STDD [6] to detect shot 
boundaries. The algorithm can detect both camera breaks and gradual transitions. 
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STDD significantly improves the detection performance. However the choice of 
threshold is still difficult and this effects the performance of the algorithm greatly. 
Also since STDD needs to compare backwards, it is difficult to work real-time. In 
recognizing the necessity to associate the merits of different measures, I expressed the 
idea “ Multi-scale hierarchy video segmentation ”in the paper [7]. Actually TMTT 
can be regards as an implement of the idea. 

The rest of the paper is organized as follows. In Section 2, TMTT is introduced. 
Color Ratio Histogram and Gray Level Ratio Histogram are reviewed in section 3. 
Experimental results and discussions are reported in Section 4. 

2 Shot Boundary Detection: TMTT 

The algorithm is easy to understand. Fig.l shows the figure used to illustrate the 
algorithm. The figure is supposed to stand for the gray level difference of consecutive 
frames. The length of the transition distinguishes the gradual transition and camera 
break. Then we compare the content of the frames in each pair of possible gradual 
transitions by gray level ratio histogram. As to possible camera break such as bi, 
frame b, and b[+l are to be compared. 

RHD{m,n) = Hm{i) — Hn(i) \ 

i 

Then compare RHD with a threshold T^, if it is lower than Tt, that means that they are 
still inside a shot and there is no shot shift. Otherwise it is a shot boundary. False 
transitions are mainly caused by illumination variation. Sudden illumination variation 
may be mistaken as abrupt transition and gradual illumination variation is easily be 
regarded as wipe or dissolve. Both of them will be marked as the boundaries at the 
first step of the TMTT, but they are discarded after the second step when ratio 
histogram is used. The thresholds are selected dynamically according to the idea in 
the paper [10]. 



ACiHisieeLiiive levi^l 
hisingrwn dlflfrrncr 




Fig. 1. Four boundaries including two false boundaries are found by thresholding the 
curve with the Threshold T, . b, and b 4 are the possible camera breaks while (bj, b 3 ) 
and (b 5 , b^) are the possible gradual transitions. 

The algorithm can be formulated in the following: 

1. H={ h„, hi, hj, h 3 , ... , h„ ...}, where h, is the gray level histogram of frame i. 

255 

2. HD{m,n) = \ Hmji) — Hn(i) 

i=0 
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3 . D= { t/, I d, = HD{i,i+ 1 ), t=0, 1 ,2, . . . } 

4. PB={pbo,pbi,pbi, pbj, pbj, pb,,...} 

pb,={A: |d*>T, A: is continuous }={go,gi, gz, 

5. TB={tbo, tbi,tb 2 ,tb 3 ,...,tb,.,...} 

tb,.=pb, if rhd ( pb,(go), pb/gj+l) > 

rhd(m,n) = RHm(i) — RHn(i) \ 

i 

RH( i ) is the gray level ratio histogram of the i th 
frame. 

6. Camera Breaks = {tb,- 1 1=0} 

Gradual Transitions = {tb, 1 1 >0, MQ<t<M,} 
t is the lasting time of the gradual transition. 

M|, is the minimum length for a gradual transition. M, is the maximal length for 
a gradual transition. 

The length of the transition can be used to discard some false gradual transitions. 
We denote Mq as the minimum length for a gradual transition, M, as the maximal 
length for a gradual transition. And Igt stands for the length of a transition. When it is 
less than or larger than Mi, these frames are not corresponding to a gradual 
transition. Thus some kind of camera motion (for example panning) inside a shot will 
not be alarmed as a gradual transition. 

3 Color Ratio Histogram and Gray Level Ratio Histogram 

The important thing in video segmentation is that a suitable metric should be selected 
to measure the difference between frames. Many were proposed such as pixel- or 
block-based temporal image difference [8 9] or gray and color histogram. Histogram 
has been widely used because it is simple to compute and it is insensitive to object 
motion. However they are sensitive to illumination variations. Color ratio histogram 
was adopted in shot detection [10]. Unfortunately, the computing cost is rather high. 
To reduce the computing cost, we use gray level ratio histogram as the measurement. 
The computing procedure is similar to that of the color ratio histogram. 

Though illumination may cause it change greatly, we still use gray level histogram 
first since generally illumination variation does not occur frequently [7]. Therefore it 
is not economic to calculate ratio histogram each time. It is used to measure the 
content similarity of two frames that are marked as the possible shot boundary. 

Still a problem remains that neither color ratio histogram nor Gray Level 
Histogram can solve. Firstly let us review what makes the content inside a shot 
changes. It is illumination, object motion, camera motion. The illumination and object 
motion can not affect color ratio histogram. But some camera motions do alter color 
ratio histogram. It is not easy to distinguish the changes introduced by camera 
movements such as panning or zooming from those due to special-effect transitions. 

4 Experimental Results and Discussion 

TMTT has been validated by experiments with several video sequences which include 
features related to film producing and editing such as lighting condition variation, 
object and camera motion. 
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Fig. 2 shows an example of how TMTT works. Table 1 gives the detected 
boundaries. This sequence is taken from a TV news report. There are many 
illumination variations due to the photo-chemistry-flash of journalists who took 
photograph on a ceremony. These illumination variations cause the sudden change in 
the gray level histogram. Flowever there is no false alarms made. Three boundaries 
are located correctly. There are two camera breaks and one gradual transition( wipe ). 




Fig.2. In the figure, there are three curves. The highest curve stands for the difference 
of the average gray level of two consecutive frames, the higher stands for that of the 
gray level histogram and the last one stands for that of the gray level ratio histogram. 
As we can see that the second curve is just the one as expected in Fig. 1 . The reason to 
draw the curve of the average gray level of two consecutive frames is to test the 
assumption that the process of dissolve is linear [4]. 



Table 1. The segmented shots of the video. ( 2 camera breaks and 1 wipe ) 




To investigate the tolerance and accuracy of gradual transition detection, we 
perform recall-precision to evaluate the results. Denote A, as the number of frames 
due to action i ; B, as the number of detected frames in class i ; C, as the number of 
correctly detected frames in class i. Then 
Recall(i)=C/Ai 
Precision(i)=Cj/Bi 

Where i £ {camera breaks, gradual transitions} 
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Table 2. detection results. D: correct detection; M: missed detection; F: false alarms. 



Video sequences 


Camera break 
D M F 


Gradual transition 
D M F 


1(6878 frames) 


25 0 0 


13 0 1 


2(7855 frames) 


24 0 3 


12 3 2 


Recall 


1.00 


0.89 


Precision 


0.94 


0.89 



It is easy to see that TMTT recalls all the camera breaks. Some smooth gradual 
transitions are missed. Unfortunately there are some false alarms. Knowledge of 
camera motion or dissolve and wipe will be employed to get ride of them. False 
camera break is due to the sudden movement of the camera. Table 3 gives the 
examples. First the camera is from a distance away, then it moves to near the actors; 
therefore the content of the picture changes abruptly, which makes a false camera 
break alarm. Again when the camera moves to a distant away and the content of the 
picture changes from the actors to the desert. Another false camera break is made. 
False gradual transitions are mainly due to panning. In the second line of the table 
there is an example. First the camera is focused on the Monkey King who is drawing 
the horse, then gradually the camera becomes focused on the monk on the back of the 
horse. The content of the picture changes gradually, so the sequences are marked as a 
gradual transition. But actually there is no shot shift. Very smooth shot shift is very 
likely to be missed. In the third line of the table, the transition from the horse to the 
bird is missed. 



Table 3. The failure of the TMTT 



4 


1 False Cut Camera m 

0 T7 o 1 1 1 o 1 1 Ko 


otionZooming^ 








1. False OlaClUal 1 la. 

T TV /t: 1 r' 1 


tisition Wipe 

1 T Ij.: 




If 


1' 1 




1 


1 



TMTT is able to detect both camera breaks and gradual transitions precisely with 
two measures and two thresholds. The recall rate is improved greatly. Very few shot 
boundaries are missed. However an additional process is necessary to identify 
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dissolve or wipe from the false alarms. It is impossible to distinguish wipe and 
dissolve from camera motions by histogram that can not reflect the regulation of wipe 
or dissolve. One of the possible solutions is suggested in my paper [7]. 

Further work is focused on how to distinguish dissolve and wipe from camera 
motions in other ways. What is more, another metric that can reflect dissolve or wipe 
is to be proposed. The computation of the metric should be efficient. 
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Abstract. This paper compares several information retrieval (IR) meth- 
ods applied to the problem of retrieving specific words from a handwrit- 
ten document. The methods compared include variants of the Okapi for- 
mula and Latent Semantic Indexing (LSI); recognition-based retrieval; 
and keyword search. One novel aspect of the work presented is that it 
uses the output stack of a Hidden Markov Model (HMM) handwriting 
recognizer with a 30,000-word lexicon to convert each handwritten word 
into a document which is then used for document retrieval. Preliminary 
experiments on a database of 1158 words from 75 writers indicate that 
the keyword search has superior precision and recall for text queries, and 
that ink queries result in minor performance reductions. 



1 Introduction 

The value of computerized storage of handwritten documents would be greatly 
enhanced if they could be searched and retrieved in ways analogous to the meth- 
ods used for text documents. If precise transcripts of handwritten documents 
exist, then IR techniques can be applied; however, such transcripts are typically 
too costly to generate by hand, and machine recognition methods for automating 
the process of transcript generation are far from perfect. Thus, such transcripts 
are usually incomplete and/or corrupted by incorrect transcriptions. 

One approach to handling these problems is to rely on the redundancy of the 
target documents to compensate for the noise in transcription [4]; however, this 
may not work if document word redundancy is low, and it does not allow for 
handwritten queries. Another approach uses template matching between query 
ink and document ink [1, 2, 5]; however, this can be very slow if the number of 
documents to be searched is large and the match method is very complex; also, 
it does not allow for text queries. Others have used ink simply to annotate text 
documents for IR [3] but do not handle IR of handwritten documents. 

The approach presented here avoids the weaknesses of previous methods by 
using a statistical classifier to convert each ink word of a handwritten document 
into a set of scores, one associated with each of the possible text translations 
of the ink. This set of scores is termed a “stack” . In practice, each ink word in 
each document is converted into a stack. This step need only be done once. Each 
word of an ink query is likewise converted into a stack; while each word of a text 
query is converted into a trivial stack by giving a maximum score to the query 
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word and a minimum score to all other stack entries, {i.e., we assume no error 
in entering a text query; though this assumption could be easily relaxed.) 

In this paper we focus only on single-word queries and documents, since these 
form the basis for advanced retrieval tasks. In future work, we will consider more 
complex queries and documents. 



2 Stack Retrieval Methods 



Let W be the set of all possible words and let X be a given handwritten oc- 
currence of w £ yy. We define the stack associated with I as the vector 
S{1) = (/S'i(X), 52(X), . . .) where Si{l) is the score of 1 given Wi, the i-th word 
of yy, according to some machine recognition system. In this paper we used an 
HMM [7] trained on an unconstrained, writer-independent data set to calculate 
Si{2) as a measure of the HMM’s probability of X given Wi. In practice, we 
threshold Si{2) to disregard low scores which results in stacks averaging ~ 16 
non-zero entries. For the rest of the paper, we drop explicit reference to X. 

Stack retrieval methods work by defining a measure between the stack from a 
query and the stacks from a database. The measure is used to rank each database 
stack relative to each query stack. Documents are then retrieved in their ranked 
order. The precision and recall of various measures can be compared based on 
the rankings. We now define some retrieval methods 

The keyword measure is analogous to standard keyword searches of text 
documents; however in our case, for a keyword, Wi, a stack is retrieved only if 
its corresponding stack score. Si, is above a threshold. The retrieved stacks are 
then ranked by their relative Si scores. This approach assumes that the query 
was entered as text rather than ink. 

The recognition measure (and all subsequent measures) assumes that the 
query was entered as ink and has been converted into a stack. The stack word 
with the highest score is then used as text entry to the keyword method (Sec. 2). 
Clearly this method can not work as well as the keyword measure with the 
correct word; however it will benefit from the fact that even though a query 
may be recognized incorrectly, the incorrect word will probably exist in other 
stacks of the same word. More importantly, this method allows the user to write 
a query; in some circumstances, this may be the preferred method of entry (e.g. 
PDA’s.) 

The Okapi measure [9] between a query stack, q, and a document stack, d, 
is given by 

/(l d)f{i, q)g{d, D) 

’ ^ ^ Ci+C2L{d)IA + f{i,d) 

where the inverse document frequency is given by 



g{d,D) 



f N -n{d,D)+Q.h\ 

V n{d,D) + Q.b ) ’ 



(2) 



N is the number of documents in the database; D is the set of all document 
stacks; L{d) is the length of stack d (i.e. the number of scores above a threshold); 
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f{i,d) is the term frequency of the i-th word in d which we define as the nor- 
malized recognition score of the z-th word times L{d)\ f{i,q) is the query term 
frequency; A = ^den -^(^) average length of all stacks; n{d, D) is the 

number of ink documents associated with D which have the same ground-truth 
text label as d; and C\ and C 2 are tunable parameters. In general, ground- 
truthed text labels to not exist; so the stacks must be used to estimate n(d, D). 
In our use of the Okapi formula, we have not optimized the free parameters but 
have chosen instead to use values that have been successfully used elsewhere [6, 
8] (Cl = 0.5 and C 2 = 1.5) since Okapi is known to be fairly robust to these 
parameters. 

The correlation measure between a query stack, q, and a document stack, d, 
is given by 



C{q,d) 



q ■ d 

q ■ q + d ■ d 



( 3 ) 



which is always between 0 and 0.5. The correlation is used to rank the database 
stacks. 

The cosine measure between a query stack, q, and a document stack, d, is 
given by 



cos(q, d) 



q ■ d 

V(q-q)(d-d) 



( 4 ) 



which is always between 0 and 1. The cosine is then used to rank the database 
stacks. The q and d can be pre-normalized to reduce computation. This is the 
same measure used in LSI [10]. 



3 Experiments 

In order to compare the IR performance of the various methods outlined above, 
a database of 1158 unconstrained, handwritten ink words (580 unique word 
labels) from sentences of 75 writers was converted into stacks using multistate, 
Bakis-topology HMMs [7] . The scores in the stacks correspond loosely to the log- 
likelihoods of the data given words. A beam search algorithm was used to prune 
out unlikely words from a lexicon of 30,000 words resulting in stacks ranging in 
length from 2 to 20 words with an average of 16. Each stack was associated with 
a visual ground-truthed “correct” word label. 

All of the experiments reported in this paper are carried out using leave-one- 
out cross-validation queries in which each word was removed from the database 
in turn and used as a query against the remainder with the exception of words 
which only appear once in the database (445 words) which were not used as 
queries since one can not retrieve documents that do not exist. Results were 
then tallied over all queries. 

For each query, q, we calculated recall and precision as follows: 

nc{q,Dg,0) 
n{q,Dg) 



Recall(q, Dg, 6) 



( 5 ) 
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Precision(qr, Z) 9) = (6) 

nr{q,Dg,e) 

where Dq is the set of all document stacks without the stack q, nr{q, Dq, 6) is 
the number of stacks retrieved from Dq, nc{q, D„0) is the number of correct 
stacks retrieved from Dq, and 0 is a threshold used to truncate the document 
rankings that are generated for each query. The averages over all queries of the 
recall and precision for all 9 are reported in the figures. 

4 Results 

Fig. lA compares the precision vs. recall curves for the keyword, recognition key- 
word, Okapi, cosine, correlation factor and LSI measures. Our first observation 
is that keyword performs better than recognition keyword - as expected, there 
is a significant penalty for making the query more flexible (he. including ink 
queries.) More surprising is the observation that the recognition keyword per- 
forms signihcantly better than Okapi, correlation factor, cosine and LSF which 
are comparable methods to much of current IR research. Since these results are 
on a database of 75 different writers, experiments on data from a single writer 
are likely to be better due to increased similarity of the ink. 

Fig. IB shows how the recognition keyword measure can be significantly im- 
proved by knowing whether or not the HMM correctly recognized the ink query. 
As one might expect, when the query is known to be correctly recognized by 
the HMM, the precision/recall performance is virtually the same as the keyword 
measure. This suggests that one way to improve ink query retrieval performance 
is to display the recognition results for the query to the user for verification be- 
fore performing retrieval. If the recognition is incorrect, the user could optionally 
correct it. 

In addition to straightforward ranking, we can also use a threshold on the 
stack scores to prune stacks out of the query-based rankings. The rationale 
for doing this is that stacks are less likely to correspond to the correct word 
as scores decrease. Figs. 2AB and 3A show how the precision/recall curve for 
the keyword, recognition keyword and Okapi measures improves as score cut- 
off threshold increases. Note that the precision improvement comes at cost to 
recall. The improvement in Okapi doesn’t quite match that of the keyword and 
recognition keyword, but in the low recall domain, it is closer. Also note that 
excessive pruning severely degrades performance. 

Fig. 3B shows a similar plot for the keyword measure using a rank cutoff 
to prune stacks. Rank pruning improves precision and recall more than score 
pruning. This is because the majority of the correct words are in the top two 
stack positions and for nearly all stacks containing the correct word, the correct 
word occurs in one of the top five stack positions. This suggests that this method 
can be made even faster at little loss to performance by only creating stacks of 

^ Following heuristics in the LSI literature, we chose to use an 800 dimensional LSI 
factor subspace. No optimization was performed. 
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length five. It further suggests that by combining rank and score thresholding, 
we may be able to improve precision while minimizing the recall degradation. 

5 Summary 

We have described a new IR method for ink documents based on the idea of 
converting individual words into stack documents and defining a stack measure 
for constructing a ranking for document retrieval. The method compares favor- 
ably to existing stack measures; can be used in both text and ink retrieval mode; 
and is fast because the stack comparison process is fast and the stack generation 
process for the ink database need only be performed once and can therefore be 
performed off-line and in advance. In the future, we will apply these techniques 
to more complex ink queries and ink documents. 
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Abstract. An off-line hand-written Chinese character recognizer sup- 
porting a vocabulary of 4,616 Chinese characters, alphanumerics and 
punctuation symbols has been reported. Trained with a sample for each 
character from each of 100 writers and tested on texts of 160,000 char- 
acters written by another 200 writers, the average recognition rate is 
77.2%. Two statistical language models have been investigated in this 
study. Their performances in terms of their capabilities in upgrading 
the recognition rate by 8.8% and 12.0% respectively when used as post- 
processors of the recognizer. 



1 Introduction 

Chinese characters are complex patterns of strokes. The bit-map of a character 
image can be segmented into a number of regions each of which consists of either 
purely white or purely black pixels. An unknown character image is recognized 
by identifying its regions to that of the templates. The structural information 
of an image in terms of the inter-relationship between its regions is represented 
statistically. The location and size of a region are stochastic. Even if a pixel is 
known to belong to a particular region, the cellular features [1] considered as 
a feature vector, observed at the pixel are still stochastic and different feature 
vectors can be observed at different pixels of the same region. A region is not 
characterized by just the distribution of feature vectors observed at its pixels, but 
by the stochastic relationship between it and its neighbor regions also as well as 
its location and size. The totality of such stochastic properties of a region defines 
a codeword. Hence, a codeword and a region are synonymous. A character as a 
collection of regions corresponds therefore to a codebook. 

2 Contextual Vector Quantization Character Recognizer 

A character image is abstracted into a matrix of cellular feature vectors O = [oij] 
with Oij observed at pixel (i,j)- Each Oij is modeled as a realization of a 
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random vector observable in Zij which is the region where pixel {i,j) is located. 
Zij takes one of the K qualitative values {Gi, G 2 , • • • , Gk} each of which is a 
region of the character. Each region is characterized by three sets of attributes: 
{Pr{zij = Gk), Pr(oij \ Gk), Pr{zm,n = Gi \ Zij = Gk)}- m and n are integers 
equal to i — f to z + 1 and j — 1 Xo j + 1 respectively indexing the regions of 
the immediate neighbors of pixel {i,j)- If these attributes of a region are fitted 
into the framework of a codeword, then a character is modeled by a codebook. 
Matching an unknown to a character becomes quantizing O with the codebook 
of the character. Pr{zm,n = Gi \ Zij = Gk) supplies the contextual information 
and that leads to the name Contextual Vector Quantization (CVQ). 

O is quantized by quantizing each of its pixels individually. Pixel (i,j) is 
quantized to Zij in order to maximize the posterior probability Pr( 2 ;ij| 0 ). 
In order to reduce the complexity of the problem, Zij is chosen to maximize 
Fr{zij\oij,o,^. .), where r]ij is the immediate neighborhood of pixel (i,j). Un- 
der the assumption that feature vectors in the same neighborhood are related to 
each other through the regions they belong to only, one then has this posterior 
probability proportional to: 

f 

^ ^ \ FT)Zij,Zjj^j) * ^ (1) 

where the summation is over all admissible values of Zr^ „ defining the region 
membership of the pixels in the prescribed neighborhood rjij of pixel (i,j). rjP 
is the union of rjij and {%,])■ Even with this simplification, analytical progress 
is barred in general, because Vv{zi^j,Zrf^ .) is unavailable in closed form. For 
further simplification, it is assumed that Zm,n’s, where (m, n) £ rjij, are mutually 
independent given Zij. So, 

7 ^T]i,j) — ) ( 2 ) 

A CVQ method can be derived as follows. Given a character image with 
observed feature vectors [oij], assign each Oij to region Gk if 

Gk = argmax^. .Pr{zij) ■ Pr(oij\zij) ■ 

n E PT(^Zm,n\^i,j) * Pt(^ 0 Tn,7i I Zm,7i ) (3) 

Zm,n 

where the term on the second line of Eq.(3) represents the contribution of 
contextual information. The argument of the argmaXzi j function: 

Pr(zij) ■ Pr(o,^j\zij)- 

n E PT(^Zm,n\^i,j) * Pt(^ 0 Tn,7i I Zm,7i ) (4) 

Zm,n 

is a pseudo- likelihood measurement of quantizing Oij to region Gk- 
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Upon matching an unknown image to a character template lo for identifica- 
tion, regions of the unknown image are matched to regions of ui. That in turn, 
is accomplished by identifying a region of u) for each pixel of the unknown im- 
age to be quantized to. That avoids segmenting an unknown image into regions 
explicitly and then matching them as a random graph as in [2] . This process of 
region identification for each pixel considers not just the pixel in question, but 
its neighboring pixels and the most suitable regions they belong to as well. Thus, 
recognizing a character becomes identifying the codebook that yields the mini- 
mum quantization error (measured in terms of the inverse of a pseudo-likelihood 
function) to the unknown image. 

This algorithm has been implemented in an off-line writer independent hand- 
written character recognizer supporting a vocabulary of 4,616 Chinese charac- 
ters, alphanumerics and punctuation symbols [3] . The codebook for each charac- 
ter is trained with 100 samples written by 100 writers. When tested on 160,000 
characters written by another 200 writers, the recognition rate is 77.2% 

3 Post-Processing Language Models 

If the input is a syntactically and semantically sound sequence of characters, its 
linguistic information can provide a useful basis for improving the recognition 
rate [4] . The second phase of the character recognizer is thus a language model 
which endows the recognizer with linguistic (just statistical at present) knowl- 
edge of Chinese. For each character image, the language model chooses the most 
suitable one out of the n-best candidates proposed by the image recognizer in 
order to arrive at a sequence of characters which is linguistically sound accord- 
ing to some criteria. There are two statistical language models experimented in 
this study as a post-processor of the image recognizer. They select a candidate 
according to its capability to form words with its neighboring images. 

3.1 Lexical Analysis of a Lattice of n-best Candidates 

The lexical analytic statistical language model bases on the usage frequency of 
each word in a large lexicon. This lexicon must cover most, if not all, of the 
Chinese words actively used in modern texts such as journals, newspapers, and 
literature. In order to determine the statistics of word-pairs, to enrich the lexicon 
of its vocabulary and to improve the estimates of word usage frequencies, a large 
Chinese text corpus of over 63 million characters has been acquired. The first 
step towards gathering such statistics is to segment text lines into words because 
different from texts in English, there is no explicit word marker in Chinese texts. 

Maximum matching [5] is one of the most popular structural segmentation 
algorithms for Chinese texts. This method favors long words and is a greedy 
algorithm in nature, hence, sub-optimal. Segmentation may start from either 
end of the line without any difference in segmentation results. In this study, the 
forward direction is adopted. The major advantage of maximum matching is its 
efficiency while its segmentation accuracy can be expected to lie around 95%. 
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Most Chinese linguists accept the definition of a word as the minimum unit 
that is semantically complete and can be put together as building blocks to form 
a sentence. However, in Chinese, words can be united to form compound words, 
and they in turn, can combine further to form yet higher ordered compound 
words. As a matter of fact, compound words are extremely common and they 
exist in large numbers. It is impossible to include all compound words into the 
lexicon but just to keep those which are frequently used and have closely united 
word components. A lexicon, WORDDATA, was acquired from the Institute of 
Information Science, Academia Sinica in Taiwan. There are 78,410 word entries 
in this lexicon, each associated with a usage frequency. Due to cultural differ- 
ences of the two societies, there are many words encountered in the text corpus 
but not in the lexicon. The latter must therefore be enriched before it can be 
applied to perform any lexical analysis. The first step towards this end is to 
merge a lexicon constructed in China into this one made in Taiwan, increasing 
the number of word entries to 85,855. This extended lexicon is then applied to 
segment the text corpus into words. In this process, when a word of a single 
character is encountered, word usage frequencies will be considered to decide 
if the single character should not be combined with it neighboring characters 
to form other words on the expense of the length of neighboring words. In this 
word segmentation process, words used in the text corpus but not found in the 
lexicon will be considered to be added to the latter which is eventually enriched 
to encompass 87,326 words. 

The image recognizer supplies the n-best candidates for each character image 
scanned. A line of text as a sequence of m images delimited by a pair of punc- 
tuation symbols correspond to m by n candidates. Starting from the first image 
position, the longest word that can be formed with a candidate of the image as 
the first character of the word is accepted. This repeats starting from the next 
image position lying beyond the last image of the word just formed until the end 
of the line is reached. 

As n increases, the number of coincidental word formations increases also, 
thus bringing down the recognition rate instead of upgrading it. On the other 
hand, for pages poorly recognized, n must be large enough to include the true 
candidate. A compromise on the optimal choice of n is reached by experimenting 
the effect of n on the recognition rates on another 100 pages earmarked for 
language model tuning. Consequently, n is chosen to be 6. The recognition rate 
over the test text of 160,000 characters is upgraded to 86% from 77.2%. 

3.2 A Language Model of Word Class Bigram Statistics 

The limitation of maximum matching word segmentation as a language model 
is its failure to capture the inter-dependence of words in a line of text. The 
use of bigram statistics in a language model is a step towards overcoming this 
shortcoming. Since there are over 80,000 words in the lexicon, the number of 
parameters in such a language model will be astronomical. A common practice 
is to employ the bigram statistics between word-classes instead. If a sequence 
of character images oi,..., ot is segmented into o“^,..., 
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o™'*, o™^'* correpsonding to a word sequence of wi, Wh which in turn, be- 
longing to word-classes si, Sh respectively, the soundness of the segmentation 
is measured in terms of: 

h 

L = piso I Sh) p{s^ \ Si_i)p(o5"%..., I Si) (5) 

i=l 

where sq is a word-class of punctuation symbols appearing before and after the 
sequence of character images. p(si \ Si_i) and p(so | Sh) can be collected from 
the segmented text corpus while the suitability of o^', d^' forming a word 
Wi in Si is dehned as: 

p(or , o- I s.) = p{w. \ S.) np(o- I C-) (6) 

i=i 

Here, word Wi is a character sequence p{wi \ Si) is computed from 

the segmented text corpus. p(o™’ | c™’) is a measure of similarity between the 
observed image Oj and the character of the word Wi supplied by the image 
recognizer. The principle of dynamic programming is employed to determine the 
optimal segmentation of the character images into words. 

Originally, words in WORDDATA are grouped into 192 syntactic/semantic 
word-classes with each word belonging to mostly one but up to four word-classes. 
In this investigation, each word is assigned the membership of the most impor- 
tant class indicated in WORDDATA. A natural and objective criterion in mea- 
suring the soundness of any clustering is that all members within a cluster should 
have a similar pattern of associations with all clusters. From the text corpus, the 
probability of observing word Wj of Si placed before any word of class s,, can be 
computed for all q. Associated with word Wj of Si, there is therefore a probability 
vector of 192 components, viz., for /c = 1, 2, ..., 192. p®’ is the probability 
of seeing Wj of class Si before any word of class Sk in an average line of the 
corpus. Since each word belongs to one class only in this investigation, there is 
no ambiguity if the superscript Si is dropped in p®‘ and its components. These 
vectors are normalized so that they lie on the surface of a unit hyper-sphere. 

The centroid Ci of class Si is defined as a unit vector along the direction of 
the average probability vector of all the words (weighted by the prior probability 
of the word) of the class. With this concept in mind, the homogeneity of class 
Si, a word-class of Mi words, can be defined as: 

Mi 

Hi = ^P(«;,)C,-p®* (7) 

i=i 

Various thresholds are chosen over a number of iterations so that any word-class 
with a homogeneity below it will be split into two as in ISODATA, except that the 
feature space is conhned to a unit hyper-sphere surface. A newly formed word- 
class with a homogeneity still below the threshold will be further split repeatedly. 
At the end of an iteration corresponding to a particular homogeneity threshold, 
the effect of the word-class bigram statistics language model on the recognition 
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of the 100 earmarked pages is measured. Finally, 470 word-classes are formed 
and bigram statistics between them are collected from the text corpus when the 
process converges. 

The word-class splitting process discussed above is hierarchical. To mitigate 
the ill effect caused by any mis-classification of words in WORDDATA, after the 
number of classes has stabilized at 470, each word is re-assigned to a word-class 
whose centroid has the minimum inner product with the probability vector of 
the word. As soon as a word has been re-assigned, the centroids of the two word- 
classes affected are updated accordingly. With the newly defined word member- 
ships, the probability vector of each word is re-computed by going over the text 
corpus again and so are the homogeneities of all word-classes consequently. This 
process repeats over several iterations. The average recognition rate is upgraded 
to 89.2% after the word-class reassignment process. 

4 Discussion 

An off-line hand-written Chinese character recognizer supporting a vocabulary 
of 4,616 Chinese characters, alphanumerics and punctuation symbols has been 
reported. For the blocks of related news lines with 9 candidates for each image, 
the average recognition rate using the language model as a post processor is 
87.13% compared to 77.2% achieved by the recognizer without language model. 
It shows that the language model is very effective in helping the recognizer to 
select suitable character candidates. 
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Abstract. Image retrieval using multiple features often uses explicit 
weights that represent the importance of the features in their similar- 
ity metrics. In this paper, a novel retrieval method based on Bayesian 
Learning is presented. Instead of giving every feature a weight explicitly, 
the importance of a feature is regulated implicitly by learning a user’s 
perception. Thus, the process of feature combination is adaptive and ap- 
proximate to a user’s perception. Experimental results demonstrate the 
signihcance of this method for improving the retrieval efficiency. 



1 Introduction 

Content-based image retrieval aims to find images that are similar to a user’s 
query through extracted image features. In the early work, most research fo- 
cused on retrieval by using a single feature, and the feature used is inadequate 
to describe the content of a general image. Hence, the retrieval performances are 
often unsatisfactory. To remedy the situation, retrieval using multiple features 
are now commonly used[4, 7, 9]. Often, weights are used in a linear combination 
of features. The weights are used to represent the importance of the individual 
features in the computation of a similarity metric. However, there is often not a 
linear relationship among these features in the similarity measure. In [5, 1], the 
rank of retrieved image is either used instead of similarity scores for combining 
features or used to derive weights. However, the rank is also not linearly propor- 
tional to the similarity nor to each feature. In [10], a neural network model for 
merging heterogeneous features is presented. This model can be used to deter- 
mine nonlinear relationship between features. 

When using multiple features, the key problem is how to decide the impor- 
tance of individual features. After all, every user has his own subjective percep- 
tion to the importance of these features. Or worse still the user’s perception of 
the image content is more on high level semantics than low level image features. 
Therefore, the combination of multiple features should not be rigid but adaptive 
to the user. It is then essential to learn a user’s perception when combining mul- 
tiple features for image retrieval. By doing so, the semantics is also implicitly 
learned. 
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In this paper, a novel retrieval method based on Bayesian Learning is pre- 
sented, in which Bayesian Learning is adopted to learn a user’s perception. The 
combination of different features is regulated by the learned perception from a 
relevance feedback process. The probability of positive retrieval is used as the 
similarity metric to rank candidate images and then the top Nout images are out- 
put as retrieval result. Experimental results show that this method can improve 
the retrieval efficiency significantly. 

In section 2, the system model for Bayesian Learning is presented. In section 
3, the algorithm based on Bayesian Learning is introduced. Section 4 presents 
the results. Finally, concluding remarks are given in section 5. 



2 The System Model for Bayesian Learning 




Fig. 1. The model for Bayesian Learning 



The system model for Bayesian Learning is shown in Fig.l. The block of 
Bayesian Learning is the core of the system model. It executes the learning 
process and regulates the combination of multiple features. This block receives 
different features extracted from a candidate image and outputs the correspond- 
ing similarity scores in terms of probability. The Rank and Output block ranks 
candidate images according to the corresponding probabilities and output the 
top Nout images as retrieval result at each retrieval cycle. The block of Rele- 
vance Feedback [9, 1, 8, 3, 2]is used to feed a user’s perception into the learning 
process. By Relevance Feedback, a user can submit images that he/she consid- 
ers similar to the query as positive images. The connection between the two 
blocks of Relevance Feedback and Bayesian Learning is very important in the 
model because the positive images provide the successive data for learning. The 
block of Bayesian Learning is refreshed by learning the training data, and the 
user’s perception reflected in the positive images is grasped. Then, the feature 
combination will be regulated according to the newly learned perception, and 
candidate images will be retrieved by a refreshed criterion. As shown in Fig.l, 
the whole process builds up a learning cycle. In a retrieval process, this learning 
cycle will continue until the retrieval process satisfactorily ends. 
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3 The Algorithm Based on Bayesian Learning 



Let F = {fi}, (1 < f < N) be the set of features used in image retrieval, and 
and /P be the feature vectors of a query and a candidate image, respectively. 
Also, S represents the set of the positive images that are similar to a given query. 
P{S I F) gives the probability that an image belongs to set S when multiple 
features /i(l < i < N) are considered. This probability can be used as a metric 
to rank candidate images. 

By defining a hypothesis space H = {S, S'} and F as the observed training 
data, then P(S \ F) can be obtained by Bayesian Learning [6, 11, 3, 2]. 



P{S I F) 



P{F I S)P{S) 



( 1 ) 



To obtain the rank of candidate images, all P{S \ F) for these images are 
calculated and compared. However, both P{F \ S) and P{F) are variables 
here. Yet, only P(F \ S) and P{F \ S) require learning by considering that 
P{F) = P{F I S)P{S) + P{F I S)P{S). As for the prior probabilities of P{S) 
and P{S), it is not necessary to consider them for the reason given later. 

Compared to learning P{F \ S) and P{F | S), it is more convenient to learn 
their probability densities p{F \ S) and p{F \ S) in practice. In the learning 
algorithm, they are learned from two directions, respectively. Initially, S is a 
null set while S includes all of the candidate images. When some positive images 
are submitted, they are moved from S to S. Then, p{F \ S) and p{F \ S) will be 
refreshed accordingly. 

Considering the rank of P{S \ F) is what a retrieval process is really con- 
cerned with, equation(l) is rewritten as 



P{S I F) 



1 

, r p(F|s)p(s) i~^ 

[P(F\S)P(S)_ 



( 2 ) 



Because P{S) and P{S) are dependent on users but independent on candidate 
images, the ratio of P{S) and P{S) will not affect the rank and hence are not 
considered for a given query. Then the rank of P{S \ F) can be derived as 



Rank{P{S \ F)) = Rank 
= Rank 



( P(F\P) \ 
\P(F\S)J 
P(F\P) ] 
\P(F\S)J 



( 3 ) 



Assuming that all of these fi{l < i < N) are independent of each other for 
reducing the computation. Then, 



Rank{P{S \ F)) = Rank 



(yiPifAS) 



( 4 ) 



Since estimating p{fi \ S) and p{fi \ S) will bring heavy computation for 
large feature dimensions, a function of /}, di = T{ff) =|| /} — /? || is used to 
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replace the corresponding f^. Thus, only the distributions of di corresponding 
to these /f from S and S need estimating. 

A second assumption is that the distributions of the di from S and S are all 
normal, then equation(4) can be further simplified as 



Rank{P{S \ F)) 



= Rank 



rriV ^ 
lii=l (7. . 



exp{— 



{dj-fis.i) 

. 



+ 






= Rank feti [(^)^ - (^)^]) 

= Rank \ J2i^=i D{di)^ 



( 5 ) 



Note that ^ di^k and di,fc(l ^ k < Tg) are the distances between the 

query and the Tg positive images submitted by then, while /ii y = di^k 

and di,fc(l < k < Tg) are the distances between the query and the other Tg 
images. 

From (5), it can be found that D{.) plays a similar role as a weight in deciding 
the importance of multiple features. However, the transform is more adaptive and 
has a concrete base because it is derived from the user’s perception. Furthermore, 
the more important thing is that D{.) makes the rank derived identical to the 
user’s perception. It is the ultimate goal of a retrieval process. 

Let Rp be the vector whose elements are the rank values derived from feature 
/i, and let Rp be the rank vector from multiple features according to the same 
image list. Then the distance of Rp and Rp can be shown as 



Rfi =11 Rfi - Rf II 



( 6 ) 



Apparently, the smaller the distance Lf^ for a feature is, the more powerful the 
feature’s ability on controlling the rank Rp will be. It implies simultaneously that 
the feature is more important in a retrieval. Furthermore, a feature’s normalized 
ability on controlling the rank Rp can be shown as 



cUi) 



1 

N -I 



1 - 



Lt 



TLiLf. 



( 7 ) 



If a user’s perception can really be learned here, the function of D{.) should 
make it true that the C{fi) for a feature fi focused by the user should be 
granted a higher value than the others. Simultaneously, for the feature /i, the 
corresponding retrieval efficiency should be higher than the others because it 
corresponds to the visual content which the user focuses on. Consequently, it 
can be deduced that if E(fi) is defined to represent the retrieval efficiency using 
/i, then 



Eih) > E{f,) ^ C(/0 > C{f,) {l<i,j<N,ij^ j) 



( 8 ) 




Bayesian Learning for Image Retrieval Using Multiple Features 477 



4 Performance Evaluation 

In the experiment, texture and color features are considered. The texture feature 
is based on Gabor Filtering of an image. For color feature, a color histogram is 
extracted from the RGB color space. The image database used in the experiment 
consists 1,400 general color images composed from VisTex of MIT and Corel 
Stock Photos. The images are classified into 70 classes manually by several hu- 
man observers for the purpose of evaluating experimental results. To illustrate 
the retrieval process clearly, the steps are shown below. 

1. According to /i(l < i < N) , the corresponding di are calculated for all of 
the candidate images. 

2. Rank the candidate images according to every type of di and N lists are 
formed. Base on the number Nout {Nout = 10 used here), the top 
images are collected from every list and output as the initial retrieval result. 
Duplicated images of a lower rank are ignored. 

3. By relevance feedback, all positive images among Nout are submitted. 

4. Distributions are refreshed in the following learning and for every image, its 
Dj’s are calculated and summed. 

5. A new ranking of all images is performed. Then go to step 3. 




Fig. 2. (a)The retrieval efficiency (b) The ability of a feature 



Fig. 2(a) gives the results for the retrieval using the combination of texture 
and color features. In this figure, the horizontal axis represents the percentage 
of browsed images over all of the images in the database, and the vertical axis 
represents the percentage of the retrieved positive images over all of the positive 
images in the database and this is known as Recall in image retrieval. Compar- 
ing the three curves, it can be seen that the maximum improvements on recall 
are 11.6% for color feature and 8.5% for texture feature, respectively. Moreover, 
cross-referencing Fig. 2(a) and Fig. 2(b), it can be found that the changes on ef- 
ficiency and the feature’s ability correlate. Therefore, the results support the 
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deduction of equation (8) given in section 3. In summary, the results of experi- 
ment demonstrate the effectiveness of multiple features retrieval and prove that 
it is capable to learn a user’s perception by Bayesian Learning. 

5 Conclusion 

In this paper, a novel image retrieval approach for using multiple features is 
presented. It adopts Bayesian Learning to learn a user’s perception through 
relevance feedback to regulate the importance of features without using explicit 
weights. The combination of features in the similarity metric is adaptive and 
approximate to the perception of a user. Experimental results show that the 
presented approach is capable of learning a user’s perception and hence improves 
the retrieval efficiency. 
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Abstract. Each content-based image retrieval (CBIR) method using color 
features includes its limitations to be applied to its own application areas. We 
analyze that the limitations are mostly due to the adoption of first-order or 
second-order relations among color features from a given image as its index. In 
this paper, we propose a new CBIR method based on a third-order color feature 
relations. This new method shows robustness in retrieving geometrically 
transformed images and can be applied in various areas. 



1 Introduction 

Content-based Image Retrieval (CBIR) system retrieves relevant images based on 
their contents rather than their textual descriptions. Various image features such as 
color, texture, sketch, shape, volume, etc. are studied and used as retrieval index[l, 2]. 
Color features are frequently used as an index in CBIR systems. CBIR system using 
color features, given a query image, retrieves all the images whose color compositions 
are similar to that of the query image[3, 4]. They, however, are not able to represent 
spatial information of an image. They are weak at retrieving a translated, rotated 
and/or scaled image. Many efforts have been made to index the spatial information of 
an image. They can he classified into two classes according to the order definition and 
the group invariance theorem [5]. One is based on the first-order relations among 
color features and the other is based on the second-order relations. The theorem 
shows that the first-order relations among objects are variant to any transform, and the 
second-order ones are invariant to translation and rotation, but not to scaling. 
Therefore, they can retrieve translated and/or rotated images well, but not scaled ones. 
That is, their application areas are limited by the order relations utilized in the 
model[6]. We propose a new CBIR method using third-order color feature relations, 
which is invariant to scaling as well as translation and rotation. This extends its 
application into various fields. 
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2 CBIR Using First-Order and Second-Order Color Feature 
Relations 

CBIR system retrieves all the images which are similar to a query image based on a 
predefined similarity measuring method. CBIR system using color features uses color 
features as an index of a given image, comparing the similarity between them. In the 
similarity measurement system using first-order color feature relations, an image is 
divided into several sub-regions[7]. Color features are then extracted from each sub- 
region. As color features, color histogram obtained by discretizing the image colors 
and counting the number of times each discrete color occurs in the image are usually 
utilized. Then, the similarity of two images is measured by comparing the histograms 
of each corresponding sub-region. This method can not retrieve an image whose 
objects are not in the same sub-region as that of the compared image. This results in 
poor retrievals of translated images. 

To solve this problem, the methods using second-order color feature relations have 
been introduced[8, 9]. These methods assume that if two images are similar, color 
features extracted from a given image and the compared have the similar distance 
relationship in any case of translation and/or rotation of the objects in the images. 
Therefore, they can retrieve similar images by indexing the distance relations. For 
example, after a histogram is obtained from an image, three highest value histogram 
bins are selected and the average x, y positions of the pixels in each bin are 
calculated, then each distance among them are decided. Finally, they are used as an 
index of the image. These methods are useful for retrieving an image which has any 
translation and/or rotation of the objects in it. They, however, are still not able to 
retrieve a scaled image such as enlarged or shrunk images. 

From this observation, it is noted that the previous works based on first-order or 
second-order color feature relations of given images have the limitation in their 
applications and are not proper to retrieve scaled images in which objects are enlarged 
or shrunk. 



3 CBIR Using Third-Order Color Feature Relations 

To solve these problems, we use third-order color feature relations as its retrieval 
index. It extracts three angles among the corresponding average x, y positions of the 
pixels in the three highest value histogram bins of a given image. Angle is defined as 
a third-order relation among three objects and is invariant to scale as well as 
translation and rotation of objects in the image by the group invariance theorem. The 
idea is described as follows. If a color histogram H is defined as a vector {ho ,hj , ... 
,hj, where each element h. represents the number of pixels of the i-th bin in a color 
histogram of a given image, three kinds of the similarities between two images can 
be measured using the following similarity measurement equation: 



D = w,D^ + w^D^ + w^Dg 



( 1 ) 



where 
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Wi are weights for each order relation, which determines the importance of each 
term according to its application area. y~(y~) are averages of x (y)-coordinate values 

of pixels in the i-th color histogram element(bin). a, b and c are distances between 
each two vertices which are determined by the average positions of the first three 
largest color histogram bins /t,. a , p and y are angles among each three vertices 
described in the previous section, and represent the similarities of first- 

order, second-order and third-order color feature relations in a given image 
respectively. 

CBIR using the proposed similarity measurement in Eq. (1) can be used in any 
application areas such as for any translated, rotated and/or scaled images, because it 
includes first-order to third-order relations of color features of a given image. In 
addition, it performs well since each weight of order terms can be adjusted according 
to its application area. If we set Wj and W 2 to zero, we can apply it to third-order 
application area. 



4 Experiments 

To evaluate the performance of the proposed method, we conducted three 
experiments; The first one and the second one were performed with four types of 
geometrically transformed pottery images and the last one with images from various 
domains. We used a database of 350 color images which are 256*256 in size and 
classified into 26 groups selected from various application domain sources. All the 
experiments are performed in the HIS color space, which is more similar to human 
visual system than RGB one. We discretize H(Hue) to 6 levels, S(Saturation) to 2 
levels, and I(Intensity) to 2 levels, and made an index vector of 24 elements (bins). 
Experimental results are shown in Table 1, 2 and 3. 

In the first experiment, the similarity using the three metric, 0 ^ and^j^, are 

demonstrated in Table 1. It shows that first-order measuring, are changed 

dramatically as the order of transform in the given image is increased; The order of 
translation and rotation transform is 2 and the order of scaling transform is 1. D^, 

second-order measuring, are good for the translated image and the rotated one, which 
are second-order transform, but still not good enough for the scaled image, which are 
obtained from third-order transform. In the other hand, Dg , third-order measuring, are 

good for the the scaled image as well as the other images. 

Table 2 shows the retrieval results using different order relationship for four types 
of geometrically transformed (raw, translated, rotated and scaled) pottery images. 
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’0’(’X’) means a success(failure) in retrieving a pottery image (a member of the same 
group as a transformed query image). The most five similar images, which are 
measured by the Eq.(l), are selected as retrieval results. 

Table 3 shows the retrieval success rate with 26 images (one from each group) 
from the whole image database. The success rate represents the percentage of the 
resulting images belonging to the same group as the query image. 



5 Conclusions 

We have proposed a new CBIR method using third-order color feature relations, 
which is robust to scale as well as translation and/or rotation of objects in a given 
image. First, we show that the previous indexing methods use first-order or second- 
order color feature relations, and are not adequate for retrieving an image which has 
enlarged or shrunk objects. Then, to solve such geometrical transform problems, we 
defined a third-order color feature relations with three largest color histogram 
components. Based on this, a geometrical transform invariant CBIR method is 
proposed. In addition, we have proposed an adaptation method that performs well and 
presented the comparative results to the previous works. The experiments show that 
our method based on third order relations is very efficient at retrieving scaled images 
as well as translated or rotated ones. The retrieving rates of the proposed method are 
more stable and higher than those of the previous methods based on first-order or 
second-order relations in case of retrieving images from various domains, too. We 
believe that these results are due to a superior geometrieal transform invariance of 
third-order relations. 

The proposed method, however, has some problems to be solved. First, 
background colors in an image affect the retrieving result. Usually, it is difficult to 
remove it from a given image. The other is related with normalization methods of the 
three metric, and respectively. Currently, it depends on each application 

area and should be chosen case by case. Although the proposed method still has some 
problems, it can be applied to the multimedia data retrieval systems effectively. 
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Table 1. Three similarity measurement examples between an original image and its three 

typical transformed images 
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Table 2. Retrieval result with geometrically transformed images 
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Table 3. Retrieval success rate with images from various domain 
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Abstract, This paper presents an incremental algorithm for classifica- 
tion problems using hierarchical discriminant analysis for real-time learn- 
ing and testing applications. Virtual labels are automatically formed by 
clustering in the output space. These virtual labels are used for the pro- 
cess of deriving discriminating features in the input space. This pro- 
cedure is performed recursively in a coarse-to-fine fashion resulting in 
a tree, called incremental hierarchical discriminating regression (IHDR) 
method. Embedded in the tree is a hierarchical probability distribution 
model used to prune unlikely cases. A sample size dependent negative- 
log-likelihood (NLL) metric is used to deal with large-sample size cases, 
small-sample size cases, and unbalanced-sample size cases, measured 
among different internal nodes of the IHDR algorithm. We report the 
experimental results of the proposed algorithm for an OCR classification 
problem and an image orientation classification problem. 



1 Introduction 

In many document processing tasks, such as OCR and image orientation clas- 
sification problems, rich information in the input image can be preserved by 
treating an input image as a high dimensional input vector, where each pixel 
corresponds to a dimension of the vector. Due to the large number of deforma- 
tion variations that exist in the training samples, building a classifier corresponds 
to build an image information database. Therefore, the classifier must address 
three issues: fast retrieval, accurate performance, and the capability to incre- 
mentally update representation in the database without re-extracting features 
from large databases. 

In this paper, we present an incremental way of constructing a decision tree 
that uses discriminant analysis at each internal node. Decision trees organize the 
data in a hierarchy so that retrieval time can be logarithmic. To acquire accu- 
rate performance, a good feature extraction method is required. The appearance 
approach has drawn much attention in machine vision [1]. The features derived 
from the linear discriminant analysis (LDA) are meant for well distinguishing 
different classes and thus are relatively better for the purpose of classification, 
provided that the samples contain sufficient information [2]. It is desirable to 
incorporate the incremental learning capability to a classifier. The classifier can 
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Fig. 1. Some sample hand- written images used in the experiment. 



adapt to new training data without re-computing the statistics from the whole 
image database. Another advantage of incremental training is that an incremen- 
tal method allows the trainer to interleave training and testing, which enables 
the human trainer to train the system with “weak cases” , potentially reducing 
the database size and cost of training process. 

Two problems in digital documentation analysis are used to test the newly 
proposed algorithm. The first one is the hand-written digits recognition problem, 
as shown in Fig. 1. The other problem we tested is to detect the image orientation 
among four possibilities. We applied our algorithms to these two applications and 
compared with other major algorithms. 



2 The Method 

Two types of clusters are incrementally updated at each node of the IHDR al- 
gorithm — y-clusters and x-clusters. The y-clusters are clusters in the output 
space y and x-clusters are those in the input space X. There are a maximum 
of q (e.g., q = 10) clusters of each type at each node. The q y-clusters deter- 
mine the virtual class label of each arriving sample (x, y) based on its y part. 
Each x-cluster approximates the sample population in X space for the samples 
that belong to it. It may spawn a child node from the current node if a finer 
approximation is required. At each node, y in (a;, y) finds the nearest y-cluster 
in Euclidean distance and updates (pulling) the center of the y-cluster. This y- 
cluster indicates which corresponding x-cluster the input (x, y) belongs to. Then, 
the X part of {x, y) is used to update the statistics of the x-cluster (the mean 
vector and the covariance matrix). These statistics of every x-cluster are used to 
estimate the probability for the current sample (a;, y) to belong to the x-cluster, 
whose probability distribution is modeled as a multidimensional Gaussian at this 
level. In other words, each node models a region of the input space X using q 
Gaussians. Each Gaussian will be modeled by more small Gaussians in the next 
tree level if the current node is not a leaf node. Each x-cluster in the leaf node 
is linked with the corresponding y-cluster. 

We define a discriminating subspace as the linear space that passes through 
the centers of these x-clusters. A total of q centers of the q x-clusters give 
q — I discriminating features which span {q — l)-dimensional discriminating 
space. A probability-based distance called size-dependent negative-log-likelihood 
(SNLL) [3] is computed from x to each of the q x-clusters to determine which x- 
cluster should be further searched. If the probability is high enough, the sample 
(a;, y) should further search the corresponding child (maybe more than one but 
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with an upper bound k) recursively, until the corresponding terminal nodes are 
found. 

The algorithm incrementally builds a regression tree from a sequence of train- 
ing samples as shown in 2. Due to the space limit, we briefly give a sketch of 
the proposed incremental hierarchical discriminant regression algorithm in the 
following procedures . A more detailed description of the algorithm can be found 
in [4]. 

Procedure 1 Update-node; Given a node N and {x,y), update the node N us- 
ing {x,y) reeursively. The parameters include: k which specifies the upper bound 
in the width of parallel tree search; 4, the sensitivity of the IHDR tree in X space 
as a threshold to further explore a branch; and c representing if a node is on the 
central search path. Each returned node has a flag c. If c = 1, the node is a 
central cluster and c = 0 otherwise. 

1. Find the top matched x-cluster in the following way. If c = 0 skip to .step 
(2). If y is given, do (a) and (b); otherwise do (b). 

(a) Update the mean of the y-cluster nearest y in Euclidean distance. In- 
crementally update the mean and the covariance matrix of the x-cluster 
corresponding to the y-cluster. 

(b) Find the x-cluster nearest x according to the probability-based distances. 
The central x-cluster is this x-cluster. Update the central x-cluster if it 
has not been updated in (a). Mark this central x-cluster as active. 

2. For all the x-clusters of the node N, compute the probability-based distances 
for X to belong to each x-cluster. 

3. Rank the distances in increasing order. 

4 . In addition to the central x-cluster, choose peripheral x-clusters according 
to increasing distances until the distance is larger than Sx or a total of k 
x-clusters have been chosen. 

5. Return the chosen x-clusters as active clusters. 

Procedure 2 Update-tree; Given the root of the tree and sample (x,y), update 
the tree using (x,y). If y is not given, estimate y and the corresponding confi- 
dence. The parameters include: k which specifies the upper bound in the width of 
parallel tree search. 




Fig. 2. An illustration of the proposed IHDR algorithm. Inside the node shown the 
discriminating subspace represented as images. 
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1. From the root of the tree, update the node by calling Update-node. 

2. For every active cluster received, check if it points to a child node. If it does, 
mark it inactive and explore the child node by calling Update-node. At most 

active x-clusters can be returned this way if each node has at most q 
children. 

3. The new central x-cluster is marked as active. 

4 . Mark additional active x-clusters according to the smallest probability-based 
distance d, up to k total if there are that many x-clusters with d < 6^. 

5. Do the above steps 2 through 4 recursively until all the resulting active x- 
clusters are all terminal. 

6. Each leaf node keeps samples (or sample means) {xi,yi) that belong to it. 
If y is not given, the output is tji if Xi is the nearest neighbor among these 
samples. If y is given, do the following: If \ \y — yi\\ is smaller than an error 
tolerance, {x,y) updates (xi,yi) only. Otherwise, (x,y) is a new sample to 
keep in the leaf. 

7. If the number of samples exceeds the number required for estimating statistics 
in new child, the top-matched x-cluster in the leaf node along the central path 
spawns a child which has q new x-clusters. 

3 The Experimental Results 

We report two types of experiments for the proposed IHDR algorithm. For each 
experiment, we incrementally update means in input space as output label. 

3.1 Recognition of hand-written digits 

The data set we used for this test is the MNIST DATABASE of hand-written 
digits from AT&T Labs- Research [5]. The MNIST database of hand- written 
digits has a training set of 60,000 examples, and a test set of 10,000 examples. 
The digits have been size-normalized and centered in a fixed-size image. 

Many methods have been tested with this training set and test set. We com- 
pared our method with others in Table 1. Since we only tested the original data 



Table 1. Performance for MNIST data 



Method 


Test error rate (%) 


Real time training 


Real time testing 


linear classifier (1-layer NN) 


12.0 


No 


Yes 


pairwise linear classifier 


7.6 


No 


Yes 


K-nearest-neighbors, Euclidean 




Yes 


No 


40 PGA -|- quadratic classifier 


3.3 


No 


No 


1000 RBF -t linear classifier 


3.6 


No 


No 


SVM deg 4 polynomial 


1.1 


No 


No 


2-layer NN, 300 hidden units 


4.7 


No 


Yes 


2-layer NN, 1000 hidden units 


4.5 


No 


Yes 


IHDR 


2.66 


Yes 


Yes 
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BPiBBaH 



Fig. 3. Images with different orientations 



set, the comparison is only applied to the methods with original data set^. The 
result of the IHDR method gave the second best accurate, second to SVM, which 
is a batch method, requiring significant time for training and testing. In Table 1, 
we also compare the feasibility of real time training and testing among different 
methods. The IHDR method can be trained and tested in real time (both take 
about 200 ms per input) while other methods cannot do both. Fig. 2 shows the 
discriminating subspace of some top nodes of the automatically generated IHDR 
algorithm. 



3.2 Detection of image orientation 

The orientation of an image is possible to be at any angle. But the edge of 
acquired image is typically parallel to the scanner, so the orientation of the 
image is likely one of the four degrees: 0°, 90°, 180° or 270°. The orientation 
detection problem is then defined as a 4-class classihcation problem as follows: 
given a scanned photograph, determine its correct orientation from among the 
four possible ones. Some sample images are shown in Fig. 3. 

Based on the works of Vailaya and Jain [6], we process images to compute 
the local regional moments of the images in the LUV color space. An image is 
split into 10 X 10 blocks. 3 means and 3 variances in 3D color space on each 
block construct a 600-dimensional raw input space. To equally weight each in- 
put component, we further normalized the each input component so that each 
component is ranged from 0 to 1. We have tested our algorithms on a database 
of 16344 images, which consists of 7980 training samples(1995 samples per class) 
and 8364 test samples (2091 samples per class). 

Table 2 shows the performance comparison of different 2 classifiers. Our 
method gave a slightly better error rate than the results reported in [6] . 

4 Conclusions 

We proposed an incremental hierarchical discriminant regression method which 
clusters in both output and input spaces. To deal with high- dimensional input 
space in which some components are not very useful and some can be very noisy, 
a discriminating subspace is incrementally derived at each internal node of the 

^ in [5], they also reports methods using deskewed samples and augmented samples 
with artificially distorted versions of the original training data 
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Table 2. The error rates for image orientation detection 



Method 


TEST ERROR RATE (%) 


LVQ 


12.8% 


IHDR 


9.7% 



tree. Our experimental studies with hand-written digits recognition and image 
orientation detection have showed that the method can achieve a reasonably 
good performance compared to other classifiers. Incremental building of the tree 
opens up the possibility of real-time interactive training where the number of 
training samples is too large to be stored or to be processed in a batch but the 
reported IHDR does not need to store all the training samples. 
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Abstract 

Precise determination of object planes in images is very important in applications of computer 
vision, such as pattern recognition and 3D reconstruction. The comers of a polygonal object 
plane, e.g. roof, wall, etc., can be determined in an image by detecting and intersecting edge 
straight lines bounding the plane. Any two non-parallel lines in an image intersect at an image 
point. If this intersection corresponds to a 3D point in the scene, it is called a real intersection, 
otherwise it is called a virtual intersection. An automatic system for locating image lines is likely 
to produce many virtual intersections. This paper presents a computational technique to 
discriminate between real and virtual intersections. The method is based on rectified images 
obtained from a pair of uncalibrated images and is illustrated with images of a real scene. The 
results obtained showed reliable decisions. 

Keywords: intersections, image rectification, stereo matching, pattern recognition. 



1. Introduction 

Despite many studies in the field of boundary recognition, the question of whether the 
intersection of two lines in an image of a 3D scene corresponds to a real object point 
still merits further investigation. The technique presented here has five phases: 1) a pair 
of images of the scene is transformed into a pseudo stereo pair using the DRUI 
algorithm [1]; 2) edges are obtained using a Canny edge detector [2]; 3) straight lines 
are extracted from the rectified images using the Hough Transform [3] and their 
geometrical and photometrical properties are determined; 4) the lines in left and right 
rectified images are matched, and their intersections determined; 5) the decision on 
whether each intersection is real or virtual is taken. A flowchart for this vision system 
is shown in Figure 1. Each phase is discussed briefly in turn in the following sections. 






-^Rectified left image 
Edge detection I 

I ' 



"^Rectified right image^ 

T 

I Edge detection I 



I Line extraction including I 

I I iine descripCons | 

T$r ' 



Matching extracted 
intersected iines in 
both imaBea 






List of real 
intersection 
points 



I List of virtual 
I intersection 



Figure 1 . Flowchart of the vision system. 
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2, Phase 1: Image Rectification 

An important stage in the 3D interpretation of two images of a scene, captured with 
lateral displacement between the view-points, is the production of a pair of rectified 
images; the depth of a point in the scene may then be related to horizontal disparity. 
Received methods of rectification either require some knowledge of the camera 
calibration [4, 5] or involve some decision-making to determine the optimal 
transformation [6, 7]. The DRUI algorithm provides a general and unambiguous 
method for the rectification of a stereo pair of images from an uncalibrated camera. 
The method has been applied successfully to a wide variety of images. An example is 
shown in Figure 2, where two 1524 by 1012 pixel images of a set of blocks have been 
captured with a digital camera. The fundamental matrix was determined using the 
eight-point algorithm [8] with 16 pairs of manually selected points. The epipolar lines 
corresponding to the selected points are superimposed on the images. The average 
root-mean-square perpendicular distance between point and corresponding epipolar 
line is 0.19 pixels. The details of the rectification process are described in reference 
[1]. The rectified images are shown in Figure 3. 




Figure 2. Images of left and right views of blocks with epipolar 




Figure 3. Images of left and right views of blocks after 



Figure 4. Results of Canny edge detection with cr = 1 in the Gaussian filtering 
kernel and with higher and lower threshold of 40 and 20 respectively. 
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3. Phase 2: Edge Detection and Thinning 

It is important to use a reliable edge detection and thinning procedure, because noise 
tends to cause false lines. The Roberts, Prewitt, Sobel and Laplacian filters [9] together 
with the Canny edge detector [2] have been evaluated using the rectified images shown 
in Figure 3. Visual inspection of the results showed that the Canny method was the 
most reliable with regard to connectivity and number of extracted edges. Consequently 
this approach was adopted in our system for edge detection and thinning. 

The performance of the Canny edge finder was investigated over a range of Gaussian 
smoothing functions and threshold parameters. The best results were obtained with a = 
1 and with higher and lower thresholds [2] of 40 and 20 respectively. The resulting 
edge images are shown in Figure 4. 

4. Phase 3: Edge Line Extraction 

Many approaches have been developed to extract straight lines and determine the 
attributes associated with them. The most widely used is the Hough Transform (HT). 
The HT is used to locate collinear edge pixels, and has been shown to perform well 
even with noisy images [3]. In this section, extracting straight lines from images using 
the HT is briefly described. 

The HT parameter space has angle Q ranging from 0° to 180° in steps of 1° and 
parameter p ranging from -D to +D in steps of 1 pixel, where D is the image diagonal. 
The size of the template used to define a local maximum in HT space affects the 
number of lines detected. If the size is too large, lines may be missed, but if it is too 
small then spurious lines may be generated. This is illustrated in Figure 5. A 
compromise size 3° x 5 pixel was chosen for the template. After creating a list of local 
maxima in HT space (i.e. a list of lines in the image), edge points are assigned to 
members of this list as follows. Each edge point is transformed into a line in HT space. 
If this line passes through a region of size 2° x 26 pixels centred on a local maximum, 
then the edge point is assigned to the image line corresponding to that maximum. 




Figure 5. A simple example of redundant 
lines in HT space. 
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Once all edge points are assigned to lines, further checks are carried out. The edge 
points within each list are ordered and checked for line breaks. If two neighbouring 
pixels are found to be separated by a gap of more than 4 pixels, then the list is split into 
two, each containing only the pixels on one side of the gap. Any two lists of points 
with similar p and 0 values are further tested. If they are found to contain two end 
pixels separated by a gap of less than 5 pixels, then the two lists are merged as 
illustrated in Figure 6. The main problem with this procedure is the difficulty in linking 
lines that happen to have large gaps because of occlusion. 

The number of redundant line segments is further reduced by checking for overlapping 
lines and discarding the shorter ones if the absolute difference between their angles is 
less than a threshold value as shown in Figure 5. Finally a line is fitted, using the least- 
squares criterion, to each list of edge points. If less than 70% of the edge points are 
within a distance of 2 pixels from the fitted line then the line is discarded. 

A line segment is defined geometrically by its end-points, from which its length, slope 
and intercept can be calculated. The extreme points of an extracted line do not 
necessarily define accurate end-points for a straight-line segment because the extracted 
pixels are spread around the fitted line. The estimated endpoints are therefore forced to 
lie on the fitted line as shown in Figure 7. The final results for straight edge lines 
extracted from the left and right rectified images in Figure 3 are shown in Figure 8. 



^ (a h) 




Figure 7. Line fitting and end-point 
determination. 




Having extracted lines from the images, additional photometric attributes of the lines 
are obtained as a further aid to matching in phase 5. Four photometric measures are 
used namely the average of intensities in strips of width 15 pixels along both sides of 
each lines, the average intensity gradient across the line and the sign of the intensity 
gradient. 
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Figure 8. Straight lines extracted from rectified left (shown at top) and 
right (shown at bottom) images. 



5. Phase 4: Matching the Straight Line Segments 

Existing approaches to matching lines in pairs of images are of two types: those that 
match individual line segments [10, 11], and those that match groups of line segments 
[12], Both approaches generally match lines on the basis of their geometrical attributes 
(orientation, length, etc.) but more geometrical information is available in the latter 
case. The authors’ system falls into the latter category since it first identifies pairs of 
intersecting lines in left and right rectified images and then matches them. 

Intersection points within each image are determined by considering each line pair in 
turn. First a check is made that the lines are not almost parallel and, if not, the 
intersection point is then found. If the intersection lies more than 4 pixels beyond the 
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end point of either lines then it is discarded. The features used in the matching process 
are 1) the angle between the pair of intersecting lines; 2) the orientation of the line pair; 
3) the ratio of lengths of the two lines (shorter : longer); 4) the four photometric 
measures specified at the end of the previous section. 

An intersection formed by a pair of lines in the left image is taken to match a similar 
intersection in the right image if these features are the same within defined limits and 
the following constraints are satisfied. Spatial relationships between matching features 
are generally maintained and the y co-ordinates of corresponding endpoints must agree 
within a specified error. The results of this matching procedure, for lines in the images 
shown in Figure 4, are presented in Table 1. From a collection of 43 intersections in the 
left image and 46 in the right image, there are 26 correspondences which were 
matched. Visual inspection verified all matches were correct. 



Left Image 


Right Image 


Status 


No. 


Intersecting Lines 


Corresponding Intersecting Lines 


1st. Line 
no. 


2nd. Line 
no. 


X 


Y 


1st. Line 
no. 


2nd. Line 
no. 


X 


Y 


Assigned 

Status 


True. 

Status 


1 


1 


56 


1322 


419 


59 


60 


1320 


392 


V 


V 


2 


3 


45 


511 


472 


40 


44 




477 


V 


V 


3 


4 


33 


515 


307 


1 


32 


KU 




R 


V* 


4 


5 


29 


497 


259 


2 


28 




261 


V 


V 


5 


6 


30 


00 

Cfi 


410 


3 


27 




409 


R 


R 


6 


6 


41 


466 


278 


3 


38 


gw 


278 


R 


R 


7 


8 


46 


753 


157 


5 


49 


m-xim 


178 


V 


V 


8 




24 


309 


380 


8 


20 


hei 


379 


R 


R 


9 








382 


8 


37 


■JEI 


381 


R 


R 


■El 


11 




Km 


479 


9 


52 


gifcg 


463 


V 


V 


11 


14 


40 


202 


391 


10 


37 


HK 


391 


R 


R 


■a 


14 


44 


216 


479 


10 


41 


glM 


478 


R 


R 


13 


15 


46 


749 


223 


12 


49 


gfcfcg 


217 


V 


V 


14 


19 


48 


778 


630 


14 


46 






V 


V 


■a 


19 


50 


916 


744 


14 


52 


HK 


765 


V 


V 


16 


20 


32 


244 


172 


13 


35 


gfcfcg 


172 


R 


R 


17 


22 


28 


127 


164 


18 


25 


166 


165 


R 


R 


18 


24 


40 


311 


382 


20 


37 




381 


R 


R 


MM 


25 


32 


246 


161 


24 


35 


■tW 


161 


R 


R 


Eil 


25 


39 


397 


148 


24 


43 


gfiM 


147 


R 


R 


21 


28 


32 


257 


100 


25 


35 


344 


100 


R 


R 


KM 


30 


38 


362 


367 


27 


36 




366 


R 


R 




31 


54 


1173 


298 


33 


58 




301 


V 


V 


El 


35 


45 


279 


502 


31 


44 




504 


V 


V 


25 


48 


51 


774 


891 


46 


53 


707 


890 


R 


V* 


26 


53 


56 


1114 


694 


57 


59 


1114 


652 


V 


V 



* Incorrect Decision. 

Table 1: The status of the intersections of the matched intersecting lines in the left and 
right images (R: Real; V: Virtual). 



6, Phase 5: Distinguishing Real and Virtual Intersection 

A real intersection point should have the same vertical co-ordinate in the left and right 
hand rectified images. The list of intersections produced in phase 4 is checked by this 
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criterion. Any cases showing a significant vertical discrepancy are labelled as virtual 
intersections. Of the remainder, there is a difficulty when one of the intersecting lines 
is nearly horizontal in the rectified image, since it is then impossible to determine 
whether the intersection is real or virtual. The remaining cases are taken to be real 
intersections, and the horizontal disparity may be used as an indicator of distance from 
the camera base line. The intersections are listed in Table 1, with indication of the true 
status of each, as deduced from visual analysis. The are just two cases where the 
assigned status is incorrect, both of which involve a nearly horizontal. 

7. Discussion and Conclusion 

The five-phase procedure described in this paper has been successful in identifying thirteen 
real intersection points in the object space from the two rectified images. At the same time 
eleven virtual intersections, arising from line pairs which do not intersect in object space, 
were rejected. The thirteen real pairs can be added to the initial set of correspondences to 
improve the rectification procedure. There were also two incorrect assignments (number 3 and 
25), where virtual intersections were assigned as real. In both of these cases one edge was 
nearly horizontal in the rectified images, and consequently no vertical disparity could be 
measured. Such cases do not interfere with the operation of the rectification procedure, but 
would result in spurious estimation of the apparent depth. 
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Abstract. This paper presents a new cooperative algorithm based on the 
integration of stereo matching and segmentation. Stereo correspondence is 
recovered from two stereo images with the help of a segmentation result. Using a 
genetic algorithm (GA)-based image segmentation, we can refine the depth map 
more effectively. Experimental results are presented to illustrate the performances 
of the proposed method. 



1 Introduction 

Recovering 3-D information from a set of several images is an active research area in 
computer vision. The use of a pair of stereo images as a passive approach to range 
sensing has been extensively studied. Stereopsis is important for object tracking and 3D 
shape reconstruction in the traditional computer vision, and is becoming one of the key 
factors for image-based modeling and rendering. 

Although stereo correspondence problem has been studied for a long time, the results 
are not very satisfactory [1]. To overcome difficulties such as noise and occlusion effect, 
several integrated methods have been proposed. Anthony attempted to unify stereo and 
motion cues [2]. Liu and Skerjanc introduced image pyramid in the stereo and motion 
matching process [7]. Some difficulties in binocular stereo can be eliminated by using 
three or more cameras to get the input images. Ruichek tried to use a sequence of stereo 
pairs [4]. Resolving ambiguities in stereo matching requires additional constraints (local 
supports) such as continuity constraints, disparity gradient constraints, and disparity 
similarity functions [3]. Most of the existing approaches assume the continuity or 
smoothness of visible surface, and impose a smoothness constraint upon the solution. 
This is usually true for a single surface, but is not true for an entire scene because a 
scene usually consists of different surfaces between which there is a discontinuity [3]. 

In this paper, under the assumption that a region that with a similar depth usually has 
a similar color, we use a more strict constraint based on the region segmentation results. 
Given two stereo images, the solution for correspondence problem is in a form of a 
disparity map. We initialize a disparity map in a traditional manner: a disparity map is 
obtained by calculating the sum of the squared difference between a pair of stereo 
images for a particular window size. Using a GA-based segmentation image, we can 
refine the depth map more precisely. A mean filter is used to locally average the 
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disparity map in the same segmentation region. Comparisons between the traditional 
method and the proposed method are presented. 



2 GA-based Segmentation 

Segmentation is a process of identifying uniform regions based on certain conditions. 
Since segmentation is a very highly combinatorial complex problem [5,6], many GA- 
based methods are proposed. In [5], we proposed a GA-based image segmentation 
method, which can be implemented in a massively parallel manner and can produce 
accurate segmentation results. 

The segmentation is computed by chromosomes that evolve by a distributed genetic 
algorithm (DGA). Each chromosome represents a pixel and evolves independently. The 
chromosomes start with random solutions, and evolve iteratvely through selection and 
genetic operations. In DGA, these operations are performed on locally distributed 
subgroups of chromosomes, called a window, rather than on the whole populations. In 
selection, the chromosomes are updated to make new chromosomes by an elitist 
selection scheme. These operations are iteratively performed until the stopping criterion 
is satisfied. For the stopping criterion, the equilibirum is defined in [5]. The stopping 
criterion is reached when the euqilibrium is above the equilibrium threshold or the 
number of generations is more than the maximal number. 




Fig. 1. The structure of GA-based segmentation [5] 

In Fig. I, the structure of chromosome is shown. A population is a set of 
chromosomes and represents a segmentation result. A chromosome consists of a label 
and a feature vector. For each chromosome k, its fitness is defined as the difference 
between the estimated color vector y={ yi, ..., yp ) and the actual color vector x={ Xi, 

. . ., Xpl at the location of the chromosome on the image. The fitness function is 

/(k) = -^|x,-y, 



( 1 ) 
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3 Correspondence Matching 

The proposed method is carried out in two steps. We initialize a disparity map in a 
traditional manner: a disparity map is obtained by calculating the sum of the squared 
difference between a pair of stereo images for a particular window size. Using the GA- 
based segmentation image, we can refine the depth map more effectively. In other 
words, the GA-based segmentation results are applied to refine the depth map at each 
pixel position. 



3.1 Initial correspondence matching 

A disparity map is obtained by calculating the sum of the squared difference (SSD) 
between a pair of stereo images for a particular window size, i.e., 

h >/2 w /2 ( 2 ) 

SSD(x,y)= Y, + ^ (-^ + «> T + i)F ’ 

i=—w/2 j=-wl2 

where SSD(x,y) represents the SSD of a pixel (x,y) in the left image, d is the window shift 
in the right image, w is the window size, and 7; and 7^ are the gray-levels of the left and 
the right images, respectively. 

The value of d that minimizes the SSD is considered to be the disparity at each pixel 
position (x,y). Object distance to the camera, z is related to the distance by the equation: 

,-bf/ (3) 

/d' 

where b is the baseline or separation between the cameras, / is the focal length, and d is 
the horizontal disparity between two points. 



3.2 Refinement using a segmentation-based constraint 

An image segmentation result and a disparity map have been considered to be discrete; 
i.e., a collection of points. Let d(i,j)={(i,j) :l<i <m, l^<n} denote the mxn lattice such 
that the elements in d index the disparity on the image pixels. Let the segmentation result 
be s( 7j>{ f ij) :l<i <m, l<]<n). 

We use a mean filter to locally average the disparity map. We take kxl neighborhood 
about pixel (iJ). Let h be the refined disparity map, and n be the number of pixels which 
have the same segment label in the kxl neighborhood. 

1 i /2 ;/2 (4) 

W a=-kl2b=-ll2 



u{i, j,k, 1) = 



1 , 

0, 



if s(i, j) = s(k,l) 
others 



(5) 
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We perform the Equation (4) iteratively until the stopping criterion is satisfied. We 
have to set the neighborhood size and stopping criterion. The values of these parameters 
are determined empirically. 



4 Experimental Results 

In this section, we briefly describe our algorithm along with a running example. In order 
to verify the effectiveness of the method, experiments were performed on several test 
images. A set of commonly used stereo images was taken as the test images: The EPI 
Tree Set, Ball Set, etc. (http://cmp.felk.cvut.cz/~sara/Stereo/ New/Matching/smm.html). 




(c) (d) 



Fig. 2. Experimental result: (a) and (b) images are left and right images, respectively, (c) is the 
initial depth image generated by minimizing SSD, and (d) is the final depth image refined using 
segmentation results 

In the image segmentation, it should be noted that all the DGA parameters, such as 
the window size and probabilities of the genetic operations, have an influence on the 
performance of the algorithms. The equilibrium threshold was set to 100% and the 
maximal number of generations was set to 500. The parameters were as follows: the 
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window size was 5x5; the probabilities of mutation and crossover were 0.01 and 0.005, 
respectively. The label size was 40. DGA is simulated on a sequential machine. In the 
disparity refinement stage, the size of the kxl neighborhood is very important for 
considering time complexity and performance. Too small windows contain too little 
information, while for large windows much time is needed. In this experiment we set the 
neighborhood size to 5x5. Stopping criterion for the refinement is just set to 10 times. 




(c) (d) 

Fig. 3. Experimental result 



Table 1. Processing time taken to find a disparity map for an image (256x256) 





Stages 


Time (sec.) 


Region 


Selection 


0.022 


A generation Crossover 


0.002 


Mutation 


0.002 


Segmentation 


Average time for an image 
( average # of generations x time for a 
generation ) 


4.940 

(190x0.026) 


Correspondence _ 


Initial stereo correspondence 


1.870 


Matching 


Refinery stage 


3.314 




Total 


10.124 
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Figure 2 and 3 show some results produced by the new algorithm on the two classical 
test images. The disparity maps obtained by the proposed methods on these standard test 
images are compared with the traditional area-based approaches. In these figures, depth 
information is displayed as intensity images. The resulting disparity maps are more 
accurate on the borders of object and occluded regions. Table 1 shows the average time 
for segmentation of an image and correspondence matching process. 



5 Conclusions 

Experimental results in this research show that this algorithm is more robust than the 
traditional correspondence matching algorithm. The performance of the proposed 
method depends on two parts: the segmentation performance and initial stereo 
correspondence. If the segmentation result is not so good, the refined disparity map may 
be worse. We try to use relaxational scheme to refine the disparity map. Also we try to 
strongly integrate the segmentation stage and the stereo matching stage into a single 
stage. In conclusion, although the proposed method requires more time to segment and 
refine the depth map than the traditional method, better stereo correspondence results 
can be obtained. 
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Abstract. The effect of random news on the performance of adaptive agents as investors in 
stock market is modelled by genetic algorithm and measured by their portfolio values. The 
agents are defined by the rules evolved from a simple genetic algorithm, based on the rate of 
correct prediction on past data. The effects of random news are incorporated via a model of herd 
effect to characterize the human nature of the investors in changing their original plan of 
investment when the news contradicts their prediction. The random news is generated by white 
noise, with equal probability of being good and bad news. Several artificial time series with 
different memory factors in the time correlation function are used to measure the performance 
of the agents after the training and testing. A universal feature that greedy and confident 
investors outperform others emerges from this study. 



1, Introduction 

In the analysis of stock market, often computer programs modelling various 
investment strategies are employed and statistical results on the yields of these 
strategies are used as a measurement for both the models and the decision process 
involved in the implementation of the strategies [1-4]. Usually the computer program 
is a given set of investment rules, extracted from historical data of the market, or rules 
based on fundamental analysis or news obtained from the inner circle of the trade. It 
is very difficult in real life to separate out the importance of technical analysis from 
other means of prediction, such as fundamental analysis with random news. 
Furthermore, the decision process of a trader with a given investment strategy may be 
very complex, as sudden news from fellow traders may alter the decision. 
Consequently, it is very difficult to combine trading strategies with the complex 
psychological processes taking place in the actual decision of the trader in a general 
model and achieve a reasonable understanding of the global behaviours of traders [3]. 
In order to initiate the investigation of these complex phenomena, we break the 
problem into several simpler problems and hope that the solution of each will shed 
some light on the general pattern of traders. We first modify the traditional economic 
models where homogeneous agents operating in isolation are used by two simple 
steps. The first step is to incorporate a certain level of communication among traders 
so that their decision processes are affected by their interactions. The second step is to 
relax the condition that the agents are homogeneous. There are of course many ways 
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to introduce heterogeneous agents, but our focus is on the individuality of the agent, 
with personal character which entails different psychological responses to other 
agents. These two steps are by nature not easy to model, as human interactions and 
psychological responses are themselves very challenging topics. Nevertheless, we 
start by considering some techniques used by physicists. The first problem of 
incorporating interactions can be modelled by the standard technique of mean field 
theory, meaning that each trader will interact with the average trader of the market, 
who is representative of the general atmosphere of investment at the time. This will 
naturally incur some error as effects of fluctuation, or volatility of the stock market, 
may not be adequately treated. Next, we like to consider the psychological response of 
individual trader. We can introduce simple quantitative parameters to measure 
individual characteristics of the trader, so that the response to the general atmosphere 
of the market is activated according to these parameters. Our model includes 
heterogeneous agents who are represented by different rules of investment as well as 
different human characters. The interactions between agents are simplified into 
interaction with a mean field, or the general atmosphere of the market. We model the 
general atmosphere of the market in an ad hoc manner, in that we do not deduce it 
from a model of microscopic interaction between agents, but rather by a source of 
random news which will serve as a kind of external, uncontrollable stimulus to the 
market. We understand that this simple model of stock market traders lacks the 
generality of a realistic agent. However, the important thing to note is that refinement 
can be introduced later to model better the interactions, while the heterogeneity of 
agents can be tuned by introducing more parameters in describing their individualities. 
We hope that this new approach of modelling the microscopic agents is a first step 
towards building a more comprehensive model of the stock market and its complex 
patterns. Our generic trader is an optimal rule in forecasting, using a genetic 
algorithms framework [5-10], where the relative performance of the individual agents 
(chromosomes) is compared in a finite population under the Darwinian principle of 
the survival of the fittest. In such a model, by suitably defining a measure of fitness on 
the level of individual agents and group of agents, self-organized behaviour in the 
population during evolution emerges. Furthermore, automatic control of the diversity 
of the population and increase in the average fitness of the entire population are 
observed [8-10]. A selection of fit rule and the subsequent augmentation of individual 
character of the agents will follow. Their performance is measured by the net asset 
value of their portfolios after a given period. 

2, Prediction as an Optimization Problem 

In using Genetic Algorithm for forecasting [5-10], the problem can be considered as 
pattern recognition and the subsequent optimisation of the rate of correct prediction. 
Since the objective of this work is to test the effects of news on the performance of the 
agents, we will employ a simple Genetic Algorithm for the forecasting of the time 
series, and focus our attention on the performance of portfolios of the agents. We 
perform training and testing for a given time series, x(t), with 2000 data points by first 
dividing it into three parts. The first 800 points form the training set for extracting 
rules. The next 100 points form the test set, used for evaluating the performance of the 
set of rules obtained after training. The last 1100 points form the news set for 
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investigating the performance of investors with different degree of greed and different 
level of indifference to the random news. In the training set, we make the usual 
assumption that at time t, the value x(t) is a function of the value at x(t-l), x(t-2),..., 
x(t-k). Here k is set to 8. As to the rules of forecasting, we use the linear model of time 
series and assume a relation between the predicted value x(t) and its precedent: 

x{t) = ^ - i) ■ The objective is to find a set of { p_ } to minimize the root 

mean square error in x compared with the true value x. Here, we do not perform any 
vector quantization on the series {x(t)}, so that each x(t) is a real value. Note that 
these x(t) values can represent the daily rate of return of a chosen stock, with 
I x(t) |< 1 . We also assume the same condition on p. , so that | p. |< i,; = i,.., k ■ 
Since our objective is to look at performance of agents who buy and sell the particular 
stock, we only care for the sign of X . What this means is that for X positive, the 
agent predicts an increase of the value of the stock and will act according to his 
specific strategy of trading. If X is non-positive, then he will predict either an 
unchanged stock price or a decrease, and will also act according to his specific 
strategy. We count the guess as a correct one if the sign of the guess value is the same 
as the actual value at that particular time, otherwise the guess is wrong. If the actual 
value is zero, it is not counted. The performance index that measures the fitness 
value of the chromosome is designed as the ratio: = N ^ /(N ^ + N . Here Nc is 

the number of correct guess and N„ is the number of wrong guess. Note that in this 
simple genetic algorithm, we do not worry about the absolute difference between x 
and X , only paying attention to their signs. This feature can be refined by a more 
detailed classification of the quality of the prediction. Furthermore, while most 
investors make hard decision on buy and sell, the amount of asset involved can be a 
soft decision. Indeed, when we introduce the greed parameter for the agent as 
discussed below, the decision on the amount of asset involved in the trading of a stock 
can be considerably softened. For the purpose of the present work, the prediction 
based on signs will be sufficient to obtain general insights of the coupling between 
agents and news, and the effect of this coupling on the global behaviour. Finally, we 
should remark that agents do not predict when X is zero, corresponding to the 
situation of holding onto the asset. We start with a set of 100 rules (chromosomes) 
represented by {(3i}. By comparing the performance of the chromosomes, the 
maximum fitness value is the result found by genetic algorithm using a modification 
of the Monte Carlo method that consists of selection, crossover and mutation 
operators [5-10]. We will leave the details of the genetic algorithms in a separate 
paper, here we just state the results. After several thousands of generations, we 
observe a saturated value of Pc, and we choose the chromosome corresponding to this 
Pc as the generic rule of prediction. We simply make use of the adaptive nature of the 
different chromosomes in a Darwinian world to single out the best performing 
chromosome to be the prototype of agents with different human characters. We 
introduce two parameters to characterize different human nature of the best agent. 
These two parameters are the greediness "g" and level of fear "f. The final set of 
agents, all with the same chromosome (or rule), but with different parameters of greed 
g and fear f, will be used for the performance evaluation in their portfolio 
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management in response to online news. In the news set (last 1100 data points), we 
assign identical initial asset to the agents. For instance, we give each rule 
(chromosome, portfolio manager, or agent) an initial cash amount of 10,000 USD and 
a number of shares =100. The value of f and g ranged from 0 to 0.96 in increment of 
0.04 will be used to define a set of 25x25=625 different agents. We will then observe 
the net asset value of these 625 portfolios as they trade in the presence of news. 

3, News Generation 

For a given past pattern, a particular agent will first make a comparison of this data 
with his rule and if the pattern matches his rule, then the prediction according to the 
rule is made. Without the news, the prediction is definite, for the agent is supposed to 
execute the action suggested by the rule. However, in the presence of the news, the 
agent will have to re-evaluate his action, reflecting the changed circumstances implied 
by the news. The present work treats 'news' as a randomly generated time series. This 
can of course be made more realistic by taking some kind of average of many real 
series of news as the input stimulus to the agent. One can also include a more detailed 
model of interaction of agents so that the input stimulus to the agent is neither an 
artificial time series, nor an external time series of news, but an internally generated 
series that reflect the dynamics of interacting agents. This will be studied in a future 
paper. Next, we consider the interaction of an individual agent with the randomly 
generated time series of news. The agent has to decide whether and how his/her action 
should be modified in views of the news. In making these judgements, the agent must 
anticipate certain probability of change, which reflects the 'greed' and the 'fear' of the 
agent in his decision process. For example, when there is news that is good in 
conventional wisdom, the stock market price is generally expected to increase. An 
agent, who had originally forecasted a drop in the stock price tomorrow and planned 
to sell the stock at today's price by taking profit, may change his plan after the arrival 
of the 'good' news, and halt his selling decision, or even convert selling into buying. 
This is a reversal of his original decision that is solely based on historical data 
analysis. Similarly, for an agent who originally wanted to buy the stock at today's 
price, as his forecast for tomorrow is a rise in stock price, may halt his buying action 
because 'bad' news just arrives. Instead of buying today, he may sell or hold on to his 
cash, for fear of a crash. This kind of reversal of buying action may in reality trigger 
panic selling. These behaviours anticipate immediate effect of news, thereby often 
reverse original decision based on rational analysis on historical data through pattern 
matching. To incorporate these realistic features of the markets, we introduce two 
additional real numbers to model the market. The following features now characterize 
each agent. (1) An integer indicating the class of prediction. For this paper we only 
use two classes, 1 for increases and 0 for decreases or unchanged stock price. (2) A 
rule to recognize pattern in the time series. (3) A real number f to characterize the 
level of fear of the agent in his original decision. If f is 0.9, then the agent has 90% 
chance of changing his decision when news arrives that contradicts his original 
decision. This denotes an insecure investor who easily changes his investment 
strategy. If f is 0.1, then there is only 10% chance of the agent of changing his original 
decision, a sign of a more confident investor. Thus, f is a measure of fear. (4) A real 
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number g to characterize the percentage of asset allocation in following a decision to 
buy or sell. This number can be interpreted as a measure of the greediness of the 
agent. If g is 0.9, it means that the agent will invest 90% of his asset in trading, a sign 
of a greedy gambler. On the other hand, if g is 0.1, the agent only invests 10% of his 
asset in trading, a sign of a prudent investor. Thus, g is a measure of greed. 
Algorithmically, we first choose a random number c between 0 and 1 and decide that 
news is good if c > 0.5, otherwise it is bad. This model of random news may not be 
realistic, but will serve as a benchmark test on the effect of news on the agents by 
saying that there is equal chance of arrival of good news and bad news. There are four 
scenarios for the agent with news. (1) News is good and he plans to sell. (2) News is 
good and he plans to buy. (3) News is bad and he plans to sell. (4) News is bad and he 
plans to buy. Note that there is no contradiction between the agent's original plan and 
the news for case (2) and (3). But in case (1), the agent may want to reverse the selling 
action to buying action due to the good news, anticipating a rise in stock price in the 
future. Also, in case (4), the agent may decide to change his decision of buying to 
selling today, and buying stock in the future, as the news is bad and the stock price 
may fall in the future. Thus, in (1) and (4), the agent will re-evaluate his decision. He 
will first choose a random number p. If p > f, he will maintain his prediction, 
otherwise he reverses his prediction from 1 to 0 or from 0 to 1. Therefore f is a 
measure of fear of the agent. The parameter g measures the percentage of the amount 
of cash used for buying stock or the shares in selling stock. Large greed parameter g 
implies that a big gambler, and will invest a large fraction of his asset following the 
rules and the news, while a small g parameter characterizes prudent investors. 




1.0 



Fig.l Final values in cash of the portfolio of the 625 agents. Initially all agents have 
the same value at 19900. The time series for stock is Microsoft. 



4, Results 

We use several sets of time series, including the real stock value of Microsoft, the 
long and short memory time series with controlled auto-correlation generated using 
the inverse whitening transformation. All these time series show similar behaviour. In 
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Fig.l we show the effects of news on the performance of the agents in terms of the 
steady state or 'final' values of their portfolio. Empirically, we find that the net asset of 
the portfolio, (cash plus stock) reaches a steady value after more than 1000 responses 
to random news. From our numerical experiment with a ten days probation period, 
(we then have 110 evaluations on the news set), we observe to our surprise similar 
patterns to all the data sets. In Fig.l, we observe a trend of the portfolio measured in 
net asset value in cash to rise at large g and small f. This is an interesting universal 
behaviour that demands an explanation, which we leave it to a future paper. In a pool 
of agent that are trained by historical data, and endowed with individual characters 
like greed and fear in their investment exercises, the effects of news, generated 
randomly, show universal behaviour. This universal behaviour suggests that greedy 
(large g) and confident (small f) investors perform better. 
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Abstract. In this study we used a new agent-based approach, an ar- 
tificial market approach, to analyze the ways that dealers process the 
information in financial news. We compared between the simulation re- 
sults with virtual dealers in our model and interview data with actual 
dealers. The results showed that there were similarities between the dy- 
namics of market opinions in the artificial and actual markets. 



1 Introduction 

Large economic changes have recently brought to our attention the behavioral 
aspects of economic phenomena. And many facts shows that there is interaction 
of dealers’ forecasts when they interpret the financial and political news related 
to exchange rate dynamics. Investigators examining the interaction of forecasts 
would find it is very useful to know how dealers interpret, classify, and use news 
when they forecast rates. Although some scholars have conducted fieldwork with 
dealers or used experimental markets with human subjects in order to investigate 
the way they think, it is difficult to know the thoughts of all dealers in real time. 

We have therefore developed a new agent-based approach to investigating the 
interaction of dealers’ forecasts; an artificial market approach. Artificial markets 
are virtual markets operating on computers. In artificial markets we can directly 
examine the thought processes of virtual dealers in real time and can repeatedly 
carry out many experiments under all market conditions. 



2 Framework of the Artificial Market Approach 

The artificial market approach consists of 3 steps. (1) Fieldwork: We gathered 
field data by interviewing an actual dealer and extracted some features for model 
construction. (2) Construction of an artificial market: We implemented a multi- 
agent model that consists of computer programs as virtual dealers, a market- 
clearing mechanism, and rate determination rules. (3) Comparison with real- 
world data: We compared between simulation results with the virtual dealers 
and interview data with actual dealers. 
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3 Fieldwork 

We interviewed a chief dealer who engaged in exchange-rate transactions in the 
Tokyo foreign exchange market. The interviewee was asked to explain the rate 
dynamics of the two years from Jan. 1994 to Nov. 1995. We asked the dealer 
to do the following: (a) divide these two years into several periods according to 
his recognition of market situations, (b) talk about which factors he regarded 
as important in his rate forecasts for each period, (c) rank the factors in or- 
der of weight (importance) and explain the reason for the ranking, and (d) in 
case his forecast factors changed between periods, describe the reasons for the 
reconsideration. 

Results: The following features were developed regarding the way the inter- 
viewee changed prediction methods (the weights of factors): 

When the interviewee changed the prediction method, he communicated with 
other dealers in order to get information on which factors were regarded impor- 
tant, and then modihed the prediction method so that it better explained recent 
exchange-rate dynamics. Such communication and imitation behavior are sim- 
ilar to genetic operations in biology. When a prediction method of a dealer is 
regarded as an individual in a biological framework, the communication and 
imitation behavior of dealers correspond to “selection” and “crossover” . 

When the forecast of the interviewee was quite different from the actual 
rate, he recognized the need to change his weights. This difference between the 
forecast and the actual rate can be considered to correspond to “fitness” in a 
biological framework. 

Given the similarities between these features of interaction of dealers’ forecast 
and genetic operations, we used a Genetic Algorithm (GA) to describe agent 
learning in our artificial market model. 

4 Construction of an Artificial Market Model 

Our artificial market model (AGEDASI TOF^) has 100 agents. Each agent is a 
virtual dealer and has dollar and yen assets. The agent changes positions in the 
currencies for the purpose of making profits. 

Each week of the model consists of 5 steps^: (1) Each dealer receives 17 data 
items of economic and political news^ (Perception step), (2) predicts the future 
rate using the weighted average of news data with her own weights (Prediction 
step), and (3) determines her trading strategy (to buy or sell dollars) in order to 

^ A GEnetic-algorithmic Double Auction Simulation in TOkyo Foreign exchange mar- 
ket. AGEDASI TOE is a name of Japanese dish, fried tofu. It’s very delicious. 

The details of our model are written in preceding papers [2,3,5] 

® The items are 1. Economic activities, 2. Price, 3. Interest rates, 4. Money supply, 
5. Trade, 6. Employment, 7. Consumption, 8. Intervention, 9. Announcement, 10. 
Mark, 11. Oil, 12. Politics, 13. Stock, 14. Bond, 15. Short-term Trend 1 (Change 
in the last week), 16. Short-term Trend 2 (Change of short-term Trend 1), and 17. 
Long-term Trend (Change through five weeks). 
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maximize her utility function {Strategy Making step) every week. Then, (4) the 
equilibrium rate is determined from the supply and demand in the market (Rate 
Determination step). Finally, (5) each agent improves her weights by copying 
from the other successful agents using GA operators [1] (Adaptation .step). 

After the Adaptation Step, our model proceeds to the next week’s Pereeption 
Step. 

5 Comparison with Real-World Data 

The simulation results are compared with real-world data in order to test the 
validity of our model. First, we conducted extrapolation simulations of the rate 
dynamics from Jan. 1994 to Dec. 1995. Second, the 17 data items are classified 
into 3 categories based on the simulation results. Finally, using this classification, 
dynamics of agents’ prediction methods are compared with interview data. 



5.1 Simulation Methods 

We repeated the following procedure a hundred times in order to generate a hun- 
dred simulation paths. First, the initial population is a hundred agents whose 
weights are randomly generated. Second, we trained our model by using the 
17 real world data streams from Jan. 1992 to Dec. 1993^^. During this training 
period, we skipped the Rate Determination Step and used the cumulated value 
of differences between the forecast mean and the actual rate as the fitness in 
the Adaptation Step. Finally, for the period from Jan. 1994 to Dec. 1995 we 
conducted the extrapolation simulations. In this forecast period, our model fore- 
casted the rates in the Rate Determination Step by using only external data. We 
did not use any actual rate data, and both the internal data and the fitness were 
calculated on the basis of the rates generated by our model. We randomly se- 
lected 20% of the simulation paths and analyzed them. In the following sections, 
we illustrate the results of the analysis considering one typical path. However 
the pattern of these results are common among the selected paths. 



5.2 Classification of Data Weights 

The 17 data items were, as a result of the factor analysis of dynamic patterns of 
their weights®, classified into six factors. The matrix that is analyzed by factor 
analysis is a list of 12 weights of 100 agents every 10 week during the forecast 
period. Because this matrix includes the weight value in different weeks, it can 
represent the temporal change of weights. 

The weights of Economic activities and Price data have the largest loading 
value of the first factor. We call the first factor the ‘Price monetary factor, 

Each weekly time series was used a hundred times, so in this training period there 
were about ten thousand generations. 

® The proportion of explanation by these six factors is 67.0 %. 
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because these two data are used by the price monetary approach to econometrics. 
The second factor is related to Trade and Interest rate data, which are included 
in the portfolio balance approach in econometrics, so we call it the Portfolio 
balance factor. The third factor is related to Announcement and Employment 
data, so we call it the Announcement factor. The fourth factor is related to 
Intervention and Politics data, so we call it the Politics factor. The fifth factor 
is related to Short-term trends and Stock data, so we call it the Short-teirm 
factor. And the sixth factor is related to Long-term trend data, so we call it the 
Long-term factor. 

We combined these 6 factors into 3 categories; the Price monetary and Port- 
folio balance factors are classified into an Econometrics category, the Announce- 
ment and Politics factors are classified into a News category, and the Short-term 
and Long-term factors are classified into a Trend category. 

We compared this classification with real-world data gathered by question- 
naires [4]. The results showed that actual dealers classified news data in the same 
way as the simulation results show that the model does. 



5.3 Comparison with Interview Data 

We held interviews with two dealers who usually engaged in yen-dollar exchange 
transactions in Tokyo foreign exchange market. The first dealer (X) was a chief 
dealer in a bank. The second dealer (Y) was an interbank dealer in the same 
bank. They had more than two years of experience on the trading desk. The 
interview methods are written in section 3. Each interviewee (the dealer X and 
Y) ranked the factors in order of their weights (table 1 (a) and (b)). We compared 
temporal changes of the rank of factors in the interview data with the dynamics 
of weights in the computer simulation. 



Table 1. Results of interviews: The forecast factors are ranked in order of importance. 
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(a) Econometric category 
(x-axis: Price monetary factor, 
y-axis: Portfolio balance factor) 
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(b) News category 
(x-axis: Announcement factor, 
y-axis: Politics factor) 
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(c) Trend category 

(x-axis: Short-term trend factor, y-axis: Long-term trend factor) 
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Fig. 1. Distribution of loading values of each category: Loading values of 100 agents 
are plotted on the planes whose axes represent factors. 



Econometrics category: The distribution patterns of agents’ scores of the 
Price monetary factor and Portfolio balance factor are illustrated in Fig. la. The 
distribution patterns were stable near the origin of the coordinate axes during 
these 2 years, except that scores of the Portfolio balance factor slightly shift 
down. Thus, the simulation results showed that the influence of price monetary 
factor was stable but not so large, and that the agents paid attention to the 
portfolio balance factor in the first half of 1995. 

In the interview data of the dealer X, the weight of the trade balance factor 
was large in the hrst half of 1995 (the period VI and VII in Tablell). This sup- 
ports the simulation results. The other econometric factors were not mentioned 
in the interviews. Probably it is not necessary to bother to say about them be- 
cause their interpretation is so common and fixed during these two years. If so, 
this fact is also similar to the simulation results. 

News category: The distribution patterns of agents’ scores of the Announce- 
ment factor and Politics factor are illustrated in Fig. lb. The distribution patterns 
in 1994 and those in 1995 are clearly different. In 1994, the scores spread widely, 
while in 1995, they shifted to left and bottom areas. Thus, the simulation results 
showed that almost all the agents focused on the news category in 1995. 

Both the dealer X and Y regarded the politics, intervention, and announce- 
ment factors as important during the bubble (the period VI, VII, and VIII in 
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Tablela and the period VI, V, and VII in Tablelb). These interview data support 
the simulation results that market opinions about the news category converged 
in 1995. 

Trend category: The distribution patterns of agents’ scores of the Short-term 
factor and Long-term balance factor are illustrated in Fig.lc. At first the scores 
distributed in the minus are of the Short-term factor (March 1994). Then they 
moved to the plus area (September and December 1995). Finally, they return 
to the center of x axis, and shifted to the minus area of the Long-term factor 
(December 1995). Plus values of Short-term trend factor’s weights mean that the 
agents tend to follow the recent chart trend. Minus values of Long-term trend 
factor’s weights mean that the agents forecast that rates return to the previous 
level after large deviation. Thus, the simulation results show trend-following 
behavior of agents in 1995 and recursive expectation at the end of 1995. 

Short-term trend factors were not explicitly mentioned in the interviews. 
Both of the two dealers however emphasized the importance of market sentiment 
(bullish or bearish) during the bubble. The market sentiment can be considered 
as a representation of short-term market trend. Hence, their stress on the market 
sentiment supports the simulation results that the trend factors magnified rate 
fluctuation. Both dealers regard the deviation or chart factor as important after 
the large deviation in 1995. This fact give agreement with the simulation results 
about Long-term trend factor. 

6 Conclusion 

In this study, we took an artificial market approach and found the categoriza- 
tion of factors that was similar to the actual dealers’ categorization. Using this 
categorization, we have identified some emergent phenomena in markets such 
as rate bubbles [2, 3]. The overall results of this study show that the artificial 
market approach to modeling is effective for analyzing real-world markets. 
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Abstract. Using genetic programming, this paper proposes an agent- 
based computational modeling of double auction (DA) markets in the 
sense that a DA market is modeled as an evolving market of autonomous 
interacting traders (automated software agents). The specific DA market 
on which our modeling is based is the Santa Fe DA market ([12], [13]), 
which in structure, is a discrete-time version of the Arizona continuous- 
time experimental DA market ([14], [15]). 



1 Introduction 

The purpose of this paper is to use genetic programming as a major tool to 
evolve the traders in an agent-based model of double auction {DA) market. 
With this modeling approach, we attempt to provide an analysis of bargaining 
strategies in DA markets from an evolutionary perspective. By saying that, the 
novelties of this paper, which helps distinguish this paper from early studies are 
two folds. First of all, to out best knowledge, the existing research on bargaining 
strategies in DA markets are not agent-based models. This research is, there- 
fore, the first one. Secondly, while this research is not the first one to study the 
bargaining strategies from an evolutionary perspective, it is the first one to use 
genetic programming on this issue. We believe that genetic programming, as a 
methodological innovation to economics, may be powerful enough to enable us 
to get new insights on the form of effective trading strategies, and help us better 
understand the operation of the “invisible hand” in real-world markets. Further- 
more, since the idea “software agents”” and “automated programs" should play 
an increasing important role at the era of electronic commerce, the agent-based 
model studied in this research can be a potential contribution to electronic com- 
merce too. The rest of this section is written to justify the claimed novelties and 
significance. 

2 Bargaining Strategies in DA Markets: Early 
Development 

The double auction (DA) market has been the principal trading format for many 
types of commodities and financial instruments in organized markets around the 
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world. The pit of the Chicago Commodities market is an example of a double 
auction and the New York Stock Exchange is another. In a general context, 
traders in these institutions face a sequence of non-trivial decision problems, 
such as 

- how much should they bid or ask for their own tokens! 

- how soon should they place a bid or ask? 

- under what circumstance should they accept an outstanding bid or ask of 

some other trader? 

Since [14], the experimental studies using human subjects have provided con- 
siderable empirical evidence on trading behavior of DA markets, which, to some 
extent, demonstrates that DA markets have remarkable efficiency properties. 
Nevertheless, these studies cast little light on trading strategies which are essen- 
tially unobservable. 

Modern economic theory has attempted to explained observed trading behav- 
ior in DA markets as the rational equilibrium outcome of a well-defined game of 
incomplete information. The “null hypothesis” is that observed trading behavior 
is a realization of a Bayesian-Nash equilibrium (BNE) of this game. However, 
due to the inherent complexity of continuous-time games of incomplete informa- 
tion, it is extremely difficult to compute or even characterize these equilibria. As 
a result, relatively little is known theoretically about the nature of equilibrium 
bargaining strategies. 

3 Computational Modeling of DA Markets: 
Zero-Intelligence “Theorem” 

Recently, the computational approach, as a compliment to the analytical and the 
experimental ones, were also involved in the study of bargaining strategies in DA 
markets. Two influential early contributions in this line of research appeared in 
1993. One is [7], and the other is [12]. While both addressed the nature of the 
bargaining strategies within the context of DA markets, the motivations behind 
them are quite different. 

Motivated by a series of studies by Vriend Smith, [7] addressed the issue: 
how much intelligence is required of an agent to achieve human-level trading 
performance? Using an electronic DA market with software agents rather than 
human subjects, they found that the imposition of the budget constraint (that 
prevents zero-intelligence traders from entering into loss-making deals) is suffi- 
cient to raise the allocative efficiency of the auctions to values near 100 percent. 
The surprising and significant conclusion made by them is, therefore, that the 
traders’ motivation, intelligence, or learning have little effect on the allocative 
efficiency, which derives instead largely from the structure of the DA markets. 
Thus, they claim 



Adam Smith’s invisible hand may be more powerful than some may have 
thought; it can generate aggregate rationality not only from individual 
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Fig. 1. Four Types of Demand and Supply Curves (Adapted from [5].) 

rationality but also from individual irrationality. '' (Ibid., p.ll9, Italics 
added. ). 

Furthermore, 

... the convergence of transaction price in ZI-C markets is a consequence 
of the market discipline; trader’s attempts to maximize their prohts, or 
even their ability to remember or learn about events of the market, are 
not necessary for such convergence. (Ibid, p.l31) 

While it sounds appealing. Code and Sunder’s strong argument on zero in- 
telligence (ZI) was demonstrated to be incorrect by [5]. Using an analysis of the 
probability functions underlying DA markets populated by Code & Sunder’s ZI 
traders, [5] showed that the validity of zero-intelligence “theorem” is largely a 
matter of eoineidenee. Roughly speaking, only in a market whose supply and de- 
mand curves are mirror-symmetric, by reflection in the line of constant price at 
the equilibrium value Pq, over the range of quantities from zero to Qo (See Figure 
1-(A) above), the ZI traders can trade at the theoretical equilibrium price. In 
more general cases, cases shown in Figure 1-(B), (C) and (D), ZI traders can eas- 
ily fail. The failing of the ZI traders indicates a need for bargaining mechanisms 
more complex than the simple stochastic generation of bid and offer prices. 

While this line of research can be further pursued, one should notice that 
what actually concerns traders are their own prohts from trade. There is no 
reason why they should behave like ZI traders simply because ZI traders might 
collectively generate allocative efficiency. On the contrary, they may behave “too 
smart for their own interests” . Consequently, models with ZI or ZI-Plus traders 
are unlikely to provide a good model to the understanding of human trading 
strategies. 
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4 Computational Modeling of DA Markets: SFI DA 
Tournaments 

Leaving collective rationality aside, [12] ’s computational study of bargaining 
strategies were largely motivated by individual rationality. Instead of asking the 
minimal intelligenee required for collective rationality, they asked: is the ease 
that sophisticated strategies make individual traders better off? Their analysis 
was based on the results of computerized double auction tournaments held at 
Santa Fe Institute beginning in March 1990. 30 programs were submitted to 
these tournaments. These 30 programs were written by programmers with dif- 
ferent background knowledge (economics, computer science, cognitive science, 
mathematics, ...), and hence are quite heterogeneous in various dimensions (mod- 
eling strategies, complexity, adaptability, ...). For example, in complexity, they 
ranged from simple rule-of-thumb to sophisticated adaptive/learning procedures 
employing some of the latest ideas from the literature on artifieial intelligence 
and eognitive seienee. 

After conducting an extensive series of computer tournaments involving hun- 
dreds of thousands of individual DA games, the results may sounds to one’s sur- 
prise: nearly all of the top-ranked programs were based on a fixed set of intuitive 
rules- of-thumb. For example, the winning program, known as Kaplan’s strategy, 
makes no use of the prior information about the joint distribution of token val- 
ues, and relies on only a few key variables such as its privately assigned token 
values, the current bid and ask, its number of remaining tokens, and the time 
remaining in the current period. Quite similar to the classical result presented by 
[2] in the context of iterated prisoner’s dilemma, i.e., to be good a strategy must 
be not too elever, [12] just reconhrmed this simplicity principle. In [12], the effec- 
tive bargaining strategies are simple in all aspects, which can be characterized 
as nonadaptive, non-predietive, non- stochastic, and non- optimizing. 

Therefore, while Rust et al.’s auction markets were composed of traders with 
heterogeneous strategies, their results on the simplicity of the effective bargain- 
ing strategies, in spirit, is very similar to what Code and Sunder found in the 
markets with homogeneous traders. Moreover, the general conclusion that the 
strueture of a double auetion market is largely responsible for achieve high level 
of alloeative efficiency, regardless of the intelligence, motivation, or learning of 
the agents in the market is well accepted in both lines of study. However, as the 
reason which we shall argue below, this conclusion with the simplicity criterion 
is indeed in doubt. For convenience, we shall call this doubtful argument the 
intelligence-independent property, which should roughly capture the essence of 
zero intelligenee in [7] and “rules of thumb” in [12]. 

5 What is Missing? Evolution 

First of all, intelligence-independent property is clearly not true in the context of 
imitation dynamics. For an illustration, consider Kaplan’s strategy. The Kaplan 




Toward an Agent-Based Computational Modeling of Bargaining Strategies 521 




Fig. 2. Evolving Complexity of Traders’ Forecasting Models (Adapted from Figure 9 
in [4]). 



strategy waits in the background until the other participants have almost negoti- 
ated a trade (the bid/ask spread is small), and then jumps in and steals the deal 
if it is profitable. Suppose that we allow imitation among traders, then we would 
expect growth in the relative numbers of these sorts of background traders. Less 
profitable traders should gradually exit the market due to competitive pressure. 
In the end, all traders in the market are background traders. However, the back- 
ground traders create a negative ^information externality” by waiting for their 
opponents to make the first move. If all traders do this, little information will be 
generated and the market would be unable to function efficiently. As a result, the 
“wait in the background’ strategy would eventually be non-profitable, and hence 
certainly can no longer be effective. As a result. Rust et al.’s characterization of 
effective strategies may not able to hold in an evolutionary context. 

In fact, the simplicity principle argued by [2] is recently shown to be incorrect 
by [3]. By using a larger class of strategies, they showed that the simple Tie for 
Tat strategy was beaten by a more complex strategy called gradual in almost 
all their experiments. As a conclusion, they claimed the significance of evolution 
(adaptation). 

Evaluation can, however, not be based only on the results of complete 
classes evolution, since a strategy could have a behavior well adapted to 
this kind of environment, and not well adapted to a completely different 
environment. (Ibid, p. 40) 

The significance of evolution on the complexity of strategies was also shown 
in [4]. In their agent-based modeling of artificial stock markets, they conducted 
an analysis of the evolving complexity of each traders’ forecasting models, and 
a typical result is demonstrated in Figure 2. Their results evidence that traders 
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can evolve toward a higher degree of sophistication, while at some point in time, 
they can be simple as well. Therefore, it is very difficult to make much sense of 
the simplicity principle from a steady environment. 

6 Evolving Bargaining Strategies 

In literature, there are two studies which actually attempted to give an artificial 
life for bargaining strategies. One is [1], and the other is [11]. Both relied on 
genetic programming. Nevertheless, neither of them can be considered as a truly 
evolutionary model of DA markets. To see this, the market architecture of these 
two studies are drawn in Figure 3 and 4. 

What Andrews and Prager did was to fix a trader (Seller 1 in their case) and 
used genetic programming to evolve the trading strategies of only that trader. 
In the meantime, one opponent was assigned the trading strategies Skeleton”, 
a strategy prepared by the SFI tournament. The trading strategies of the other 
six opponents were randomly chosen from a selection of successful Santa Fe 
competitors. Therefore, what Andrews and Prager did was to see whether GP 
can help an individual trader to evolve very competitive strategies given their 
opponents’ strategies. However, since other opponents are not equipped with 
the same opportunity to adapt, this is not a really evolutionary model of DA 
markets. 

On the other hand, [ll]’s architecture can be motivated as follows. Suppose 
that you are an economists, and you would like to select a pair of bargaining 
strategies, one for all sellers, and one for all buyers. Then you are asking how to 
select such pair of rules so that the allocative efficiency can be maximized (as he 
chose the Alpha’s value as the htness function). To solve this problem, Olsson 
also used genetic programming. In this application, traders are not pursuing 
for their own interests, but try to please the economist. Moreover, they are all 
shared with the same strategy at any moment in time. Hence, Olsson’s model, 
very like the model of artihcial ants, is certainly not an evolutionary model of 
DA markets. 

In sum, while both [1] and [11] did use genetic programming to “grow;” bar- 
gaining strategies, the style by which they used GP did not define an evolutionary 
model of DA markets. 

7 Agent-Based Modeling of DA Markets: Trading 
Behavior 

[6] may be considered as the hrst agent-based computational model of DA mar- 
kets. Based on the WebPages of agent-based computational economics: 
http://www.econ.iastate.edu/tesfatsi/ace.htm, 

“Agent-based computational economics (ACE) is roughly defined by its practi- 
tioners as the computational study of economies modeled as evolving systems 
of autonomous interacting agents.... ACE is thus a blend of concepts and tools 
from evolutionary economics, cognitive science, and computer science.” 
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Andrews and Prager (1 £>£>4): Ivlarket Architecture 
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Fig. 3. The DA Market Architecture of 
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[Random] 

[Random] 

[Random] 



[ 1 ] 



The market architecture of [6] is depicted in Figures 5 and 6. He considered 
two populations of agents: 100 Buyers and 100 Sellers. Each seller has the poten- 
tial to produce one unit of the commodity every period. The production costs are 
given by c € [0, 1] (c = 0, 0.1 in his experiments). The seller produces the good 
only if he can sell it in the same period. The buyer gain utility of 1 > ti > c from 
consuming the good (u = 1,0.7 in his cases), u and c are private information. 
During each period, every seller is randomly matched with a buyer and both 
submit a sealed bid. The buyer submits the price he is willing to pay (pb), and 
the seller gives the minimum payment for which he will deliver the good {ps)- 
Buyers and sellers know that c and u lie in [0,1] and accordingly restrict their 
bids to this interval. If pb > Ps, one unit of the good is traded at a price of 

ptrade= ( 1 ) 

Otherwise, no trade takes place. He then applied the so-called single-population 
genetic algorithm to buyers and sellers simultaneously. But, constrained by the 
GA, what one can observe from Dawid’s model is only the evolution of bids 
and asks rather than the bargaining strategies by which the bids and asks are 
generated. Therefore, while Dawid is the first application of agent-based model 
to DA markets. This is really not a model suitable for the study of bargaining 
strategies. 

8 Agent-Based Modeling of DA Markets: Trading 
Strategies 

Given this literature development, the next step of the computational modeling 
of DA markets seems to be clear, and the architecture proposed in this research 
is briefed in Figure 7. 
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Olsson (1999): Market Architecture 




Fig. 4. The DA Market Architecture of [11] 



This simple architecture shows some distinguishing features of this research. 
First, it is the use of genetic programming. But, we do not just come to say: 
“Hey!. This is genetic programming. Try it! It works.’’^ To our understanding, 
genetic programming can be considered a novel micro-foundation for economics. 
In fact, its relevance to the study of adaptive behavior in economics can be 
inferred from [9]. First, he gave a notion of an agent in economics. 

In general terms, we view or model an individual as a collection of deci- 
sion rules {rules that dictate the action to be taken in given situations) 
and a set of preferences used to evaluate the outcomes arising from par- 
ticular situation- action combinations. (Ibid; p.217. Italics added.) 

Second, he proceeded to describe the adaptation of the agent. 

These decision rules are continuously under review and revision; new 
decision rules are tried and test against experience, and rules that pro- 
duce desirable outcomes supplant those that do not. (Ibid; p.217. Italics 
added.) 

Let us read these two quotations within the context of DA markets. An 
individual would be treated as a trader, and a decision rule is a just a bargaining 
strategy. To be specific, we consider the three strategies studied by [12] and [13], 
namely, the skeleton strategy, the Ringuette strategy, and the Kaplan strategy. 
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Daw id (1 999): Market Architecture 




Fig. 5. The DA Market Architecture of [6] 



The flowchart of there three strategies adapted from [13] is displayed in Figures 
8, 9, 10. 

In addition to the flow-chart representation, these three strategies can also be 
represented in what known as the parse-tree form, and are shown in Figures 11, 
12, 13. In this case, what [9] meant about a collection of decision rules (bargaining 
strategies) can be concretely represented as a collection of parse tress. Then the 
second quotation from Lucas is about the review of these bargaining strategies 
(parse trees), and from this review, new bargaining strategies (parse trees) may 
be generated. Notice that here Lucas were not talking about just a single decision 
rule but a collection of decision rules. In other words, he was talking about the 
evolution of a population of decision rules. 

Now, based on what we just described, if each decision rule can hopefully be 
written and implemented as a computer program, and since every computer pro- 
gram can be represented as a LISP parse-tree expression, then Lucasian Adaptive 
Economic Agent can be modeled as the following equivalents, 

- evolving population of computer programs, 

- evolving population of parse trees. 

But, no matter how we may call this modeling procedure, this is exactly what 
genetic programming does, and in fact, there is no other technique known to 
the projector, which can accomplish this task as effective as GP. Hence, that 
would not be too exaggerated to claim genetic programming as a methodological 
innovation to economics. 
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Tile Double Auction Market: 
Demand and Supply Curve 



Fig. 6. The DA Market of [6]: Demand and Supply 



The second distinguishing feature is not just the use of genetic programming, 
but the population genetic programming. The weakness of using simple GP in 
agent-based modeling has already been well pointed out in [4] . Again, there is no 
reason why we can assume that traders will release their bargaining strategies 
to others to imitate. Therefore, to not misuse GP in the agent-based computer 
simulation of DA markets, it is important to use population GP. 
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Fig. 8. The Flow Chart of the Skeleton Strategy 




Fig. 9. The Flow Chart of the Ringuette Strategy 
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Fig. 10. The Flow Chart of the Kaplan Strategy 



Skeleton’s Shategy: The Tree Fonii 




Fig. 11. The Skeleton Strategy in Parse- Tree Representation 
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K.aplaii Strategy: Tire Tree Foiaii 




Fig. 13. The Kaplan Bargaining Strategy in Parse- Tree Representation 
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Abstract. Ordinal data play an important part in financial forecasting. For ex- 
ample, advice from expert sources may take the form of “bullish”, “bearish” or 
“sluggish”, or "buy" or "do not buy". This paper describes an application of us- 
ing Genetic Programming (GP) to combine investment opinions. The aim is to 
combine ordinal forecast from different opinion sources in order to make better 
predictions. We tested our implementation, FGP (Financial Genetic Program- 
ming), on two data sets. In both cases, FGP generated more accurate rules than 
the individual input rules. 



1 Introduction 

Ordinal data could be useful in financial forecasting, as Fan et. al. [6] quite rightly 
pointed out. For example, forecast by experts may predict that a market is “bullish”, 
“bearish” or “sluggish”. A company’s books may show “deficit” or “surplus”. A 
share’s price today may have “risen”, “fallen” or “remained unchanged” from yester- 
day’s. The question is how to make use of such data. 

Let 7 be a series, gathered at regular intervals of time (such as daily stock market 
closing data or weekly closing price). Let 7, denote the value of 7 at time t. Forecast- 
ing at time t with a horizon h means predicting the value of 7,^^ based on some infor- 
mation set /, of other explanatory variables available at time t. The conditional mean 

F.,= E[Y,JI,] 

represents the best forecast of the most likely 7,^^ value [8]. In terms of properties of 
value 7, forecast could be classified into point forecast, where 7, is a real value, or 
ordinal forecasts, wherd^ , is an interval estimate. In terms of the property of /,, fore- 
cast could be classified into time-series forecast, where! , consists of nothing but 7,__ 
where i > 0, ocombining forecast , where! , only includes a finite direct forecast re- 
sults from different sources. 

In recent years, there has been growing interest in combining forecasts; for exam- 
ple, see [17, 13] for combining point forecasts and [6, 3] for combining ordinal fore- 
casts. The methodologies adopted in these researches are mainly statistical methods 
and operation research methods. The full potential of AI forecasting techniques such 
as genetic algorithms [9] has yet to be realized. 

K.S. Leung, L.-W. Chan, and H. Meng (Eds.): IDEAL 2000, LNCS 1983, pp. 532-537, 2000. 

© Springer-Verlag Berlin Heidelberg 2000 
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In this paper, we follow the study of Fan and his colleagues and focus on combin- 
ing ordinal forecasts. We demonstrate the potential of Genetic Programming (GP) 
[11] in combining and improving individual predictions in two different data sets: 

(i) a small data set involving the Hong Kong Heng Seng index as reported by 
Fan and his colleagues [6] ; and 

(ii) a larger data set involving S&P 500 index from 2 April 1963 to 25 Janu- 
ary 1974 (2,700 trading days). 



2 FGP for Combining Ordinal Forecasts 

2.1 Background: Genetic Programming and Its Application to Finance 

Genetic algorithm (GA) is class of optimization technique inspired by the principle of 
natural selection in evolution. Genetic Programming is a promising variant of genetic 
algorithms that evolves tree representations instead of strings. The basic algorithm is 
as follows. Candidate solutions are referred to as chromosomes and the program 
maintains a set of chromosomes, which is referred to as a population. Each chromo- 
some is evaluated for its fitness according to the function that is to be optimized. 
Fitter strings are given more chance to be picked to become parents, which will be 
used to generate offspring. Offspring copy their material from both parents using 
various mechanisms under the name of crossover. Offspring are sometimes given a 
chance to make minor random changes, which are referred to as mutations. Offspring 
may replace existing members of the population. The hope (supported by theoretical 
analysis, see for example [7]) is that after enough number of iterations, better candi- 
date solutions can be generated. GPs have been successful in many applications, in- 
cluding financial applications, e.g. see [1, 12, 14, 4]. 

FGP (Financial Genetic Programming) is a genetic programming implementation 
specialized for financial forecasting. It is built as a forecasting tool under the EDDIE 
project [16]. In this paper, we shall focus on its application in combining individual 
expert predictions in order to generate better predictions. 



2.2 Candidate Solutions Representation 

In the Hong Kong stock market example in the next section, the set of possible cate- 
gories is (bullish, bearish, sluggish, uncertain}. In the S&P 500 index example in the 
subsequent section, the set of categories is (buy, not-buy}. 

PGP searches in the space of decision trees whose nodes are functions, variables, 
and constants. Variables and constants take no arguments and they form the leaf 
nodes of the decision trees. In the applications described in this paper, both the vari- 
ables (input) and the predictions (output, constants) are ordinal categories. The 
grammar determines the expressiveness of the rules and the size of the rule space to 
be searched. Punctions take arguments and they form subtrees. In this paper, we take 
(if-then-else, and, or, not, >,<,=} as functions. 
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2.3 Experimental Details 

Being a genetic programming system, FGP needs a suitable fitness function that 
measures the predictability of each decision tree. One fundamental measure of pre- 
dictability is the rate of correctness (RC) - the proportion of correct predictions out of 
all predictions: 

RC = number of correct predictions -5- total number of predictions 

In the experiments described below, crossover rate is 90% and mutation rate is 1%. 
Elitism is employed by randomly picking 1% of the population, biased towards the 
fitter individuals, and putting them directly into the next generation. Among existed 
selection methods in GP, we used tournament selection with tournament size set to 4. 
Population size is set to 1,200. The termination condition is 40 generations or two 
hours, whichever reached first. Initial GDTs are limited to a depth of 5. The maxi- 
mum depth of any tree is set to 17. FGP-1 was implemented in Borland C+-t (version 
4.5). All experiments described in this paper were run in a Pentium PC (200MHz) 
running Windows 95 with 64 MB RAM. 



3 Application of FGP to the Hong Kong Stock Market 

FGP was applied to the prediction of changes in the Heng Seng Index in the Hong 
Kong Stock Market. We used the data set given in the appendix of [6], which com- 
prises 103 data cases, each of which comprises nine expert predictions for the follow- 
ing week and the actual market changes. Predictions by each of the 9 experts fall into 
four categories, which Fan et al. labeled as: 

1. bullish, defined as “the index rises by over 1.3% in the next week”; 

2. bearish, defined as “the index falls by over 1.3% in the next week”; 

3. sluggish, defined as “the index is neither bullish nor bearish”; and 

4. uncertain, which means the expert did not make a prediction. 

The period under this study was from 25 May 1991 to 16 October 1993. 

Fan et al [6] used the “leave-one-out cross-validation strategy” to assess the fore- 
casting accuracy. This means to generate a forecasting for time t, all but the experts’ 
predictions at time t were used to generate a combined prediction. Predictions gener- 
ated this way were evaluated. For simplicity without lost of generality, we used 3-fold 
cross-validation to estimate FGP’s forecasting performance: we partitioned the data 
set into three mutually exclusive subsets (the folds): 

Dl: 34 data cases from 25 May 1991 to 11 January 92; 

D2: 35 data cases from 18 January 1992 to 5 December 1992; 

D3: 34 data cases from 12 December 1992 to 16 October 1993 
Each of these data sets was used as the testing data set once, whilst the remaining two 
sets were employed as the training data set. The mean forecasting accuracy was the 
overall number of correct forecasts divided by number of cases in the whole data set 
[10]. Eor each of Dl, D2, D3, we ran EGP 10 times, so a total of 30 runs were used 
in our experiments. 
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FGP-1 achieved an average RC of 60.88%, 45.14% and 45.29% over Dl, D2, D3, 
respectively. The mean RC of FGP method was 50.39%, which is comparable with (if 
slightly better than) the Multinomial Logic Method (MNL, 50.16%) and the Linear 
Programming Method (LP, 45.63%) presented in [6]. The best expert prediction input 
(Expert 7) achieved an RC of 43.69%. It was encouraging to see that MNL, LP and 
FGP can all improve the accuracy of the best expert’s forecast. However, this exam- 
ple involves relatively small data cases and therefore one should not generalize the 
results without further experimentation. 



4 Application of FGP to the S&P 500 Index 

Encouraged by FGP’s promising forecasting performance on the Heng Seng Index, 
we tested FGP on the S&P-500 daily index. Available to us were data from 2 April 
1963 to 25 January 1974 (2700 data cases). Our goal is to see whether FGP could 
improve forecasting accuracy on textbook-type predictions. 

Six technical rules (three different types) derived from the financial literature [2, 5, 
15] are used as input to FGP-1. They were used to predict whether the following goal 
is achievable at any given day: 

G: the index will rise by 4% or more within 63 trading days (3 months). 

The six technical rules we used were as follows: 

• Two Moving Average Rules (MV): 

The L-days simple moving average at time t, SMV(L, t), is defined as the average 
price of the last L days from time t. The rule is “if today’s index price is greater 
than SMV(L, t), then buy; else do not buy." L = 12 and L = 50 were used. 

• Two Trading Range Break Rules (TRB): 

The rule is: "buy if today’s price is greater than the maximum of the prices in the 
previous L days; else do not buy". L = 5 and L = 50 were used. 

• Two Filter Rules: 

This rule is "buy when the price rises by y percent above its minimum of the prices 
in the previous L days; else do not buy." Two rules, with y = 1(%) and L = 5 and 
L = 10 were used. 

Our sole concern is whether FGP can combine technical rules in order to generate 
more accurate forecasting. Therefore, the quality of the individual rules is not crucial 
to our study. 

The FGP algorithm is the same as that in the first example. In addition to the rate 
of correctness (RC), we added two factors to the fitness function: the rate of missing 
chance (RMC) and the rate of failure (RF). RMC and RF are defined as follows: 

RMC = # of erroneous not-buy signals -5- total number of opportunities 
RF = # of erroneous buy signals -5- total number of buy signals 
Weights were given to RC, RMC and RF in the fitness function. By adjusting these 
weights, we can reflect the preference of investors. For example, a conservative in- 
vestor may want to avoid failure and consequently put more weight on RF. 
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Table 1. Performance comparisons between individual rules and FGP rules 



Individual rule performances 


FGP rule performances 


Rules 


Accuracy (RC) 


ARR 


Runs 


Accuracy (RC) 


ARR 


MV (L=12) 


0.4956 


0.3020 


FGP Rule 1 


0.5400 


0.3952 


MV (L=50) 


0.5189 


0.2666 


FGP Rule 2 


0.5389 


0.3945 


TRB (L=5) 


0.4733 


0.3319 


FGP Rule 3 


0.5400 


0.3952 


TRB (L=50) 


0.4756 


0.2102 


FGP Rule 4 


0.5522 


0.3911 


Filter (L=5) 


0.4944 


0.3746 


FGP Rule 5 


0.5444 


0.3964 


Filter(L=10) 


0.4889 


0.3346 


FGP Rule 6 


0.5367 


0.3935 








FGP Rule 7 


0.5389 


0.3945 








FGP Rule 8 


0.5356 


0.3928 








FGP Rule 9 


0.5433 


0.3960 








FGP Rule 10 


0.5300 


0.4187 








Mean 


0.5400 


0.3968 



In our experiments, RC, RMC and RF were given weights of 1, 0.2 and 0.3 respec- 
tively. 1,800 cases (02/04/1963 — 02/07/1970) were used as training data. 900 cases 
(06/07/1970 — 25/01/1974) were used as test data. We ran FGP 10 times. For each 
run, the best rule evolved in training was applied to the testing data. The results of 
FGP rules on testing data and the six individual rules were recorded in Table 1. 
Among the six technical rules, the MV(L=50) rule was the best individual rule for 
this set of data. It achieved an accuracy of 51.89%. In contrast, even the poorest FGP 
rule (FGP rule 10) achieved an accuracy of 53.00%. The average accuracy of FGP 
rules was 54.00%. So although only 10 decision trees were generated, the results 
were conclusive: FGP produced better forecasting consistently by combining individ- 
ual decisions. 

For reference, we measured the annualised rate of return (ARR) by the rules 
above using the following hypothetical trading behaviour with simplifying assump- 
tions: 

Hypothetical trading behaviour: whenever a buy signal is generated, one 
unit of money is invested in a portfolio reflecting the S&P-500 index. If the 
index rises by 4% or more within the next 63 days, then the portfolio is sold 
at the index price of day t; else sell the portfolio on the 63rd day, regardless 
of the price. 

We ignored transaction costs and the bid-ask spread. Results in Table 1 show that 
rules generated by FGP achieved an ARR of 39.68% in average. In comparison, the 
best of the input rules (Filter rule, with L=5) achieved an ARR of 37.46%, which is 
lower than the poorest ARR generated by FGP in the ten runs (39. 1 1% by rule 4). 
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Abstract. Eactor model is a very useful and popular model in finance. 
In this paper, we show the relation between factor model and blind source 
separation, and we propose to use Independent Component Analysis 
(ICA) as a data mining tool to construct the underlying factors and 
hence obtain the corresponding sensitivities for the factor model. 



1 Introduction 

Factor model is a fundamental model in finance. Many financial theories are 
established based on it, for examples, Modern Portfolio Theory and Arbitrage 
Pricing Theory (APT). These theories assume that the returns of securities are 
represented as linear combinations of some factors. Modern Portfolio Theory 
aims at analyzing the composition of securities in the portfolio and relates the 
return and risk of the portfolio with the security returns and risks [20]. Factor 
model serves as an efficient and common model for the return generating process 
[21, 24, 17]. Furthermore, factor model is also the foundation of Arbitrage Pricing 
Theory (APT) [5,22]. APT plays an important role in modern finance and it 
analyses the capital asset pricing in finance [9, 10]. 

Factor model relates the returns of securities to a set of factors. The factors 
can be system (market) factors or non-system (individual) factors. Finding the 
factors for the model is a challenge but not an easy task to researchers, as the 
factors are hidden and not necessary directly related to the fundamental factors, 
such as GDP. interest rate[12]. In this paper, we apply independent component 
analysis (ICA), a modern signal processing method, to recover the hidden factors 
and the corresponding sensitivities. Section 2 and 3 review the backgrounds of 
factor model and ICA. We apply ICA to factor model in section 4. Section 5 
contains the experiment and results. 

2 Factor model in finance 

Mnltifactor model is a general form of factor model [2,9,21], and is the most 
popular model for the return generating process. The return Vi on the ith security 
is represented as, 

k 

^2 — T ^ ^ i^imFm T (1) 

m—1 
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where k is the number of factors and it is a positive integer larger than zero, 
Fi, F 2 , ■■■, Fk are the factors affecting the returns of zth security and Pn, Pi 2 , 
Pik are the corresponding sensitivities. Qj- is regarded as ’’zero” factor that 
is invariant with time; Ui is a zero mean random variable of zth security. It is 
generally assumed that the covariance between Ui and factors Fi are zero. Also 
Ui and Uj for security i and j are independent if i j ■ 

The simplest factor model is one-factor model, i.e., k = 1. One-factor model 
with market index as the factor variable is called market model. However, fac- 
tor model does not restrict the factor to be the market index. Investigators use 
different approaches in factor model [19,6]. The first one assumes some known 
fundamental factors are the factors that influence the security and /3’s are eval- 
uated accordingly. The second approach assumes the sensitivities to factors are 
known, and the factors are estimated from the security returns [12]. The third 
approach is factor analysis. This one assumes neither factor values nor the secu- 
rity sensitivities is known. Under factor analysis approach, principle component 
analysis(PCA) was the most successful method [11,23,25]. PGA was used to 
find the factors and their sensitivities[2, 8] . However it was also shown that the 
separated factors are not able to truly reflect the real case but only one mean- 
ingful factor, which corresponds to the market effect, is extracted. This is due 
to two limitations of PGA. First, the separated principal components must be 
orthogonal to each other. Second, PGA uses only up to second order statistics, 
i.e. the covariance and correlation matrix. In this paper, we apply IGA to factor 
model because IGA does not have those limitations PGA has. More importantly, 
IGA is able to reflect the underlying structures of securities[l , 18]. 

3 Independent Component Analysis 

Blind source separation(RSS), a well-known problem, aims at recovering the 
sources from a set of observations. Applications include separating individual 
voices in cocktail party. In HSS problem, it contains two processes. They are 
the mixing process and demixing process. First, we observe a set of multivariate 
signals = I, 2, ..., n, that are assumed to be linearly mixed with a set of 

source signals. The mixing process is hidden so we can only observe the mixed 
signals. The task is to recover the original source signals from the observations 
through a demixing process. Equation 2 and 3 describe the mixing and demixing 
processes mathematically. 

Mixing: x — As (2) 

Demixing: y = Wx (3) 

Each signal Xi is a t time steps series, i.e. Xi = [xi[] ), Xipl), ...,Xi[t)]] x is the 
[n X t] observation matrix, i.e. x = [xi,X 2 , ...,£„]'. In HSS problem, we assume 
the number of observations is equal to the number of source signals. Matrix s 
contains the original source signals driving the observations whereas the sepa- 
rated signals are stored in matrix y. They are both [n x t] matrices. A and W 
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are both [n x n] matrices, called mixing and demixing matrix respectively. If the 
separated signals are the same as the original sonrces, the mixing matrix is the 
inverse of demixing matrix, i.e. A — W~^ . 

RSS is a difficnlt task becanse we do not have any information abont the 
sonrces and the mixing process. ICA is a method tackling this problem by assnm- 
ing that the sonrces are independent to each other[16], and finds the demixing 
matrix W and corresponding independent signals y from the observations x with 
some criteria making the separated signals as independent as possible. Varions 
ICA algorithms have been proposed. Most of them nse higher order statistics to 
obtain the independent components, e.g. [13,7,15,14,3] and [4] etc. 

4 ICA and Factor model 

4.1 Relationships between BSS and Factor Model 

Previons works have been done on nsing ICA to extract components for stocks 
[]]. However, the independent components have never been related to the factor 
models. Ry relating the independent components to the factor model, we hope 
that this techniqne can be nsed in fntnre applications of the factor model. In 
this section, we illustrate the application of ICA in factor model. Roth of them 
assnme the observations are nnder driven by a set of factors (or sonrces). We 
firstly zero mean the retnrn as 

k 

n - E[ri] - ^ l3im{Fm - E[Frr^]] + Ui (4) 

m=l 

We pnt Ri — ri — E[ri] and F^ = F^ — E[Fm]. Withont loss of generality, we 
treat the noise term, Ui, as an extra factor, i.e. Ui = j3ik + iF^_^-^^ 

fc + i 

Ri = Y, ( 5 ) 

m = l 

The above is a typical mixing process of observations in blind sonrce separation 
problem. The factor models are nnder transformed to mixing matrix and factor 
series. After the transformation, we can apply ICA to separate the sonrces (or 
factors). 

4.2 Procedures of finding factors by ICA 

Here we show the procednres of finding the factors for factor model nsing ICA. 

1. Select secnrities’ price series as observations. We transform the secnrity 

prices to retnrns i.e. ri[t) = — Pi{t — ]))/pi[t — 1) and making the retnrn 

series zero mean i.e. Ri = ri — 

2. Perform independent component separation on the retnrn series Ri. 

3. Sort the independent signals with their importance. Importance of a signal 
can be measnred by its Too [!]• 
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4. Select the number of independent signals according to the requirements of 
factor model. The rest of separated signals are regarded as residuals. 

5. Evaluate the sensitivities to the factors using the mixing matrix. 

The separated independent signals and the corresponding sensitivities are 
obtained from the above procedures. Hence the factor model is constructed using 
the observable security movements. We will demonstrate this in the experiment. 

4.3 Remarks of applying ICA to find the factors 

From above, the expected return of each security, is equal to the sum of 

factor means and zero factor. There is no information about zero factor given to 
the ICA algorithm during decomposition, because we cancelled the zero factor 
while subtracting the mean of each observation signal as in equation 4. As a 
result, we cannot separate the zero factor from the observations.^ However we 
can retain the original pricing level of each security by adding its expect value 
E[ri] to the factor model. 

5 Experiments and Results 

In the experiment, we used 7 stocks, selected from the Hang Seng Index consti- 
tutes. Daily closing prices started form 2/1 /1992 to 23/8/2000 were used(Fignre 1). 

In the experiment, we reconstruct the mnltifactor model of each stock using 
the procedures in section 4.2. Figure 2 shows the separated signals. Starting from 
top to bottom, the top most signals is the most important hidden factor, F'l, 
and so on, the last signal is named as T/. 

We reconstruct the factor models with six hidden factors, F[, T/, ..., Fg where 
the least important factor T/ is regarded as residual. The mixing matrix found 
is shown as below 



0.0145 


-0.0119 


-0.0034 


0.0055 


0.0027 


0.0138 


-0.0059 


0.0071 


-0.0169 


-0.0009 


0.0067 


0.0019 


-0.0016 


-0.0018 


0.0072 


-0.0137 


-0.0014 


0.0001 


0.0154 


0.0031 


-0.0048 


0.0095 


-0.0137 


-0.0195 


0.0016 


0.0041 


0.0053 


-0.0051 


0.0056 


-0.0180 


-0.0014 


-0.0022 


-0.0002 


0.0122 


-0.0117 


0.0166 


-0.0105 


-0.0070 


0.0035 


0.0038 


0.0020 


-0.0154 


0.0222 


-0.0158 


-0.0058 


-0.0085 


0.0014 


0.0037 


-0.0016 



The rows in the mixing matrix are the corresponding sensitivities to the hidden 
factors for the stock. To reconstruct the factor model, we take stock 1 as an 
example. Equations 6 and 7 show its return expressed as a 6-factor model and 
3 factor model respectively. 

R^t) = 0.0145 X Fi(t) - 0.0U9 x F^{t) - 0.0034 x T/(t) + 0.0055 x Fi(t) 

^ It is also a common practice to assume the expected values of the factors are zero. 
In that case, the zero factor can be obtained. 
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stocks signals as the observations 
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Fig. 1. Seven stocks’ series in the experiment 



+0.0027 X Fl(i) + 0.0138 x +6'(^) + (6) 

where wj = —0.0059 x 

R]{i) = 0.0145 X F[{t) - 0.0119 x Fl^it) - 0.0034 x F^{i) + v-i (7) 

where vi ^ 0.0055 x F!^{t) + 0.0027 x F^{t) + 0.0138 x C'(^) - 0.0059 x 

To express the return as in the factor model, we simply add the expected returns 
to Ri as ri - Ri + E[ri\. 

6 Discussions and Conclusion 

Tn this paper, we propose to apply independent component analysis (TCA) to 
extract the factors and the sensitivities of securities in the factor model. Tn some 
traditional applications of factor models, the returns are related to some sys- 
tematic factors or macro-economic variables; for examples, unexpected changes 
in the rate of inflation and the rate of return on a treasury bill. On one hand, 
it is useful to know what the exact underlying factors are. On the other hand, 
the financial market nowaday is extremely complex and dynamic, especially due 
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factors signals separated by ICA 
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Fig. 2. The separated signals are sorted with their importance. (The y-axes of the 
sub-figures do not have equal scales.) The uppermost signal is regarded as the most 
important signal and so on. 



to globalization and many newly introduced indices, such as IT index, it is 
not an easy task to decide which variables, among so many systematic factors 
and macro-economic variables, should be included in the model as factors. Onr 
method serves as a data mining technique to automatically identify the hidden 
factors from historical data. Though attempts can be made to correlate the fac- 
tors extracted to some known variables, it is still possible to apply these factor 
models in many aspects in finance. For example, we can perform risk analysis 
and construct portfolios which are less sensitive to the hidden factors. 
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Abstract. Data mining or knowledge discovery in database (KDD) is 
motivated by large amounts of computerized data and has been attracted 
a lot of interest in various areas. One area is to extract useful and predic- 
tive information from a huge financial data database so that investors can 
be more informed and makes more profitable investments. The efficiency 
of the information extraction has become the most concern problem when 
performing the extraction process. In this paper, we demonstrate how to 
apply conceptual clustering (hierarchical clustering algorithm), a data 
mining technique, on the Chinese and Hong Kong stock market’s data. 
Conceptual hierarchical tree and cluster information table will be gen- 
erated to give the concept to the clusters for further analysis in the 
subsequent mining process. 



1 Introduction 

Because of the explosive growth of many business databases, people are inter- 
ested in extracting useful and predictive information from massive databases, 
especially from stock market database. The need for a new generation of au- 
tomated and intelligent database analysis tools and techniques for Knowledge 
Discovery in Database has been created. 

Knowledge Discovery in Database (KDD) is the overall process of discovering 
useful knowledge from databases including data preparation, data selection, pre- 
processing, transformation, mining process and evaluation of the mining results. 
Data Mining, which is also referred to as knowledge discovery in databases, is 
the process of extracting previously unknown, valid and actionable information 
from large databases and then using the information to make crucial business 
decisions. 

Clustering is one of the types of data mining techniques. We examine cluster 
analysis on extraction of information from the financial raw data. We adopt 
hierarchical/Conceptnal clustering technique proposed by Hn [1] to perform the 
clustering analysis algorithm on the Chinese and Hong Kong stock market’s data 
to construct a hierarchical structure of the data and infer useful knowledge rules 
based simply on the containment relationship between different clusters. 

K.S. Leung, L.-W. Chan, and H. Meng (Eds.): IDEAL 2000, LNCS 1983, pp. 545-550, 2000. 
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2 Cluster Analysis 

Clustering analysis is the process of grouping physical or abstract objects into 
classes of similar objects. It helps to construct meaningful partitioning of a large 
set of objects based on a ’divide and conquer’ methodology, which decompose a 
large-scale system into smaller components to simplify design and implementa- 
tion [2]. 

Clustering analysis is always the first step for analyzing data in KDD process. 
It is done to discover subsets of related objects, and to find descriptions such as 
Dl. D2, D3, etc. that describe each of the these subsets depicted in figure 1. 





Construct 

► 



descriptions 



Dl 



D2 



D3 



Fig. 1. Discovering clusters and descriptions in a database 



During cluster analysis process, objects are grouped on the basis of similari- 
ties (associations) and distance (dissimilarities) to form the clusters. Therefore, 
people do not need to pre-define the number of groups to be clustered. 

In this paper, conceptual clustering method proposed in [1] is adopted to 
develop the analysis application for financial statement data. This method is 
very suitable for clustering the object classes in a very large database efficiently 
based on similarity measure that maximizes the cohesiveness (a reciprocal of the 
conceptual distance) of the clusters. 

In the first place, we cluster data using numerical taxonomy, then extract a 
characteristic feature for the cluster and finally treat each cluster as a positive 
sample to derive knowledge rules. The algorithm aggregates objects into different 
clusters first and then assigns conceptual description to object classes. Data has 
been pre-processed before clustering. Then the following clustering procedures 
undergo: 

1. Calculate the common attribute values between each pair of data in database 

2. Delete the cluster with common attribute values less than the pre-defined 
threshold value. 

3. Using single-linkage method, aggregate data to form a cluster 

4. If new cluster is produced, continue the process; otherwise, terminate the 
process 

5. Form the hierarchy based on the newly formed or untouched clusters and 
use these clusters for the next iteration. 
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The following describes the steps of the single linkage nearest neighbor method 

1 . Tf the common attribute table has more than one row, go to Step 2, otherwise 
terminate the process. 

2. First locate the maximum common attribute value (the most nearest dis- 
tance) among in the common attribute table 

3. Locate two elements that has this maximum value as Ei and Ej and combine 
these two elements as Eij and form the new common attribute table, (i.e. 
Distance between the new cluster and one element Ck are computed as 
min[Dik, Djk].) 

4. Calculate the common attribute of the new cluster 

5. After forming the new attribute table, go back to step 1. 

The output will be a cluster hierarchy of the data set represented in the form 
of Concept Elierarchical Tree [3] and cluster information table. 

3 System Overview and Design 

The clustering analysis system for the Chinese and Elong Kong stock Market’s 
data is a web-based and object oriented application over the Internet. One web 
server and database server is able to serve multiple users simultaneously. Thus, 
a three-tier design is depicted: 

1. User selects the appropriate criteria for cluster analysis application on client. 

2. The client (web browser) sends a request with the criteria object to the 
server via HTTP. 

3. The web server receives the HTTP request and forwards it to backend Java 
Servlet Program. 

4. The Java Servlet program will connect to the hnancial statement database 
and perform clustering to extract useful information from the database. 

5. After the clustering process, the servlet program will sent back the result 
object to the web server and then forward it to the client. 




Fig. 2. Three-tier-design of the system 
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Fig. 3. Architecture of the system 



4 Implementation 

In the cluster analysis system, the system is implemented by three steps: 

1. Select financial data (e.g. financial period, industry field, and financial Ratio 
fields) for clustering from the main screen of the system (see figure 4) 

2. Start the clustering process. 

3. View the clustering results. Two perspective results of hierarchical tree view 
(figure 5) and the cluster group information table (figure 6) are provided 



5 Experimental Results 

The following table shows the system performance according to different records 
sizes based on eight attributes: 



Attribute no 


Record Sets no 


Time (sec) 


Time (mins) 


8 


50 


900 


15 


8 


100 


1800 


30 


8 


200 


2400 


60 


8 


500 


9000 


150 



6 Conclusion 

In this paper, we adopt conceptual clustering methods, a data mining tech- 
nique, to analyze the consolidated Chinese and Hong Kong Stock Market’s data. 
We adopt the attribute-oriented concept tree ascending technique and integrate 
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Fig. 4. Main Screen 




Fig. 5. Hierarchical Tree View 
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Fig. 6. Cluster group information 



database operations with the learning process to form the concept hierarchical 
tree and its corresponding cluster information. The clustering result is the form of 
the conceptnal hierarchical tree and cluster information table for investor to per- 
form further analysis. The web-based, object-oriented and database approach’s 
design for the financial application break through the traditional demographics 
problem handled by clustering methods and make the clustering process fully 
accessible by users through the Internet. 
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Abstract. This paper studies asset pricing in abitrage-free financial 
markets in general state space. The mathematical formulation is based 
on a locally convex topological space for weakly arbitrage-free securities’ 
structure and a separable Banach space for strictly arbitrage-free secu- 
rities’ structure. We establish, for these two types of spaces, the weakly 
arbitrage-free pricing theorem and the strictly arbitrage-free pricing the- 
orem, respectively. 



1 Introduction 

We consider arbitrage-free asset pricing in a setting of general state spaces, 
in particular, locally convex topological space for weakly arbitrage-free security 
markets, and separable Banach space for strictly arbitrage-free security markets. 

Arbitrage-free conditions have been an important first step in the study of 
general equilibrium theorems with incomplete asset markets (DufBe Shafer 
1985; Geanakoplos 1990; Geanakoplos Shafer 1990; Hirsch, Magill Mas- 
Colell 1990; Husseini, Lasry Magill 1990; and Magill Shafer, 1991). Since 
the 1980s, for finite period economies, arbitrage-free pricing theory has been 
applied by various authors to prove the existence of general equilibrium for 
stochastic economies with incomplete financial markets (DufBe 1987, 1988, 1996; 
Florenzano Gourdel 1994; Magill Shafer 1991; Werner 1985, 1990; and Zhang 
1998). In those works, the finite number of possible states of nature and the finite- 
dimensional commodity space are usually assumed in order for the proofs to be 
carried out for the general equilibrium model with incomplete financial markets. 

Usually Stiemke’s Lemma, a strict version of Farkas-Minkowski’s Lemma, is 
applied to study the asset pricing theory with arbitrage-free conditions. As exam- 
ples, this approach is taken in discrete-time models of dynamic asset pricing the- 
ory (Duffie 1988, 1996) and the theory of economic equilibrium with incomplete 
asset markets (Geanakoplos 1990; Geanakoplos Shafer 1990; Hirsch, Magill 
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& Mas-Colell 1990; Husseini, Lasry & Magill 1990; and Magill & Shafer 1991) 
where the commodity space is of finite dimension. Farkas-Minkowski’s Lemma 
and Stiemke’s Lemma are in essence the mathematical counter part of the as- 
set pricing theory with arbitrage-free conditions. For the general state space 
model in our discussion, we obtain extensions of Farkas-Minkowski’s Lemma and 
Stiemke’s Lemma by applying Clark’s separating hyperplane theorems (Clark 
1993, 1994), and thus establish our main results. 

Harrison & Kreps (1979) initiated the study of martingales and arbitrage 
in multiperiod security markets. They first introduced general theory of arbi- 
trage in a two-period economy with uncertainty, then extended it to the models 
of multiperiod security markets and the models of continuous-time securities 
markets. Kreps (1981) studied arbitrage and equilibrium in economies with in- 
finitely many commodities and presented an abstract analysis of “arbitrage” in 
economies that have infinite dimensional commodity space. Harrison & Pliska 
(1981) studied martingales and stochastic integrals in the theory of continuous 
trading. Dalang, Morton & Willinger (1990) studied equivalent martingale mea- 
sures and no-arbitrage in stochastic securities market models. Back & Pliska 
(1991) studied the fundamental theorem of asset pricing with an infinite state 
space and showed some equivalent relations on arbitrage. Jacod & Sgiryaev 
(1998) studied local martingales and the fundamental asset pricing theorems 
in the discrete-time case. These papers studied fundamental theorems of asset 
pricing in multiperiod financial models with the help of techniques from stochas- 
tic analysis. Our work is based on separating hyperplane theorems and does not 
rely on assumptions made for stochastic analysis to be able to carry out in the 
above models. 

Friction in markets has attracted attention of several works in this field re- 
cently. Chen (1995) examined the incentives and economic roles of financial in- 
novation and at the same time studied the effectiveness of the replication-based 
arbitrage valuation approach in frictional economies (the friction means holding 
constraints). Jouini & Kallal (1995a) derived the implications of the absence 
of arbitrage in securities markets models where traded securities are subject to 
short-sales constraints and where the borrowing and lending rates differ, and 
showed that a securities price system is arbitrage free if and only if there exists 
a numeraire and an equivalent probability measure for which the normalized 
(by the numeraire) price processes of traded securities are supermartingales. 
Jouini & Kallal (1995b) derived the implications from the absence of arbitrage 
in dynamic securities markets with bid-ask spreads. The absence of arbitrage 
is equivalent to the existence of at least an equivalent probability measure that 
transforms some process between the bid and the ask price processes of traded 
securities into a martingale. Pham & Touzi (1999) addressed the problem of 
characterization of no arbitrage (strictly arbitrage-free) in the presence of fric- 
tion in a discrete-time financial model, and extended the fundamental theorem 
of asset pricing under a non-degeneracy assumption. The friction is described by 
the transaction cost rates for purchasing and selling the securities. Deng, Li & 
Wang (2000) studied the computational aspect of arbitrage in frictional markets 
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including integrality constraints. We follow the model for transaction costs from 
Pham & Touzi (1999), and then extend the first fundamental valuation theorems 
of asset pricing from frictionless security markets to frictional security markets, 
for general state space. Again, their stochastic analysis method requires stronger 
assumptions than ours. In addition, their arbitrage-free conditions are slightly 
different from ours based on that of Duffie (1988). 

In Section 2, we first present our model in frictionless markets. Follow- 
ing Duffie (1988), we define the concepts of weakly arbitrage-free and strictly 
arbitrage-free for our model. Then, we establish the first fundamental valua- 
tion theorem of asset pricing (a necessary and sufficient condition for arbitrage- 
freeness) with weakly arbitrage-free security markets and strictly arbitrage-free 
security markets. In Section 3, we extend our work to markets with transaction 
costs, following the model of Pham & Touzi (1999). 

2 Frictionless Security Markets 

We consider a two-period model (dates 0 and 1) with uncertainty over the states 
of nature in the date 1. The unknown nature of the future is represented by 
a general set Q of possible states of nature, one of which will be revealed as 
true. Here we make no assumption about the probability of these states. The J 
securities are given by a return “matrix” V = {V^ , • • • , V'^), where denotes 
the number of units of account paid by security j = 1, ■ • • , J. Let q G 71'^ denote 
the vector of prices of J securities. A portfolio 6 G 71'^ has market value q'^6 
and payoff V6. 

Let T be a topological space consisting of processes in 7Z^ , T+ is the positive 
cone of T. Let T* be the dual space composed of the continuous linear functionals 
on T, the positive cone of the space T* (the space of all positive continuous 
linear functionals on T) and T)^_|_ the interior of the cone (the space of all 
strictly positive continuous linear functionals on T): C 

In this paper, we assume G T for j = 1, ■ ■ ■ , J. Then V6 = £ 

T. Our proof must adopt the following notation 

{V) = {veeT \ ee 71^} 



and {V)+ = (F) nT+. 

Definition 1 The frictionless market (q, V) is weakly arbitrage-free if any 
portfolio 6 G 7Z‘^ of securities has a positive market value q^6 > 0 whenever it 
has a positive payoff V6 

Definition 2 The frictionless market (q,V) is strictly arbitrage-free if (1) 
any portfolio 6 G 7Z‘^ of securities has a strictly positive market value q^6 > 0 
whenever it has a positive non-zero payoff V6 G T_|_ \ {0}; and (2) any portfolio 
6 G 7Z‘^ of securities has a zero market value q^6 = 0 whenever it has a zero 
payoff V6 = 0. 
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Definition 2 implies Definition 1 obviously. We follow the definition for arbi- 
trage opportunity as provided by Duffie (1988, 1996). An arbitrage is a portfolio 
6 en^ with either (1) q^6 < 0 and V6 eT+\ {0}, or (2) q^6 < 0 and V6 eT+. 
That is to say, the frictionless market (g, V) admits an arbitrage opportunity if 
there exists an portfolio 6 G TZ'^ of securities such that either (1) q~^6 < 0 and 
V6 e T+\ {0}, or (2) q^6 < 0 and V6 G T+. Consequently, we can define the 
strictly arbitrage-free (no arbitrage) frictionless market (g, V) as follows. 

Definition 2' The frictionless market (q,V) is strictly arbitrage-free if (1) 
any portfolio 6 G TZ'^ of securities has a strictly positive market value q^6 > 0 
whenever it has a positive non-zero payoff V6 G T+ \ {0}; and (2) any portfolio 
6 G TZ‘^ of securities has a positive market value q^6 > 0 whenever it has a 
positive payoff V6 G T+. 

Lemma 1 Definitions 2 and 2' are equivalent. 

An arbitrage is therefore, in effect, a portfolio offering “something for noth- 
ing”. Not surprisingly, an arbitrage is naturally ruled out in reality. And this fact 
gives a characterization of security prices as follows: A valuation functional is a 
functional w G for the weakly arbitrage- free frictionless market (g, V) with 
consistency g^ = vV ; and a functional v eTU for the strictly arbitrage-free fric- 
tionless market (q,V) with consistency g"'" = vV, where vV = {vV^ , ■ ■ ■ ,vV‘’). 
The valuation functional is called to be a positive linear consistent valuation op- 
erator for the weakly arbitrage-free frictionless market, a strictly positive linear 
consistent valuation operator for the strictly arbitrage-free frictionless market, 
respectively. 

The idea of arbitrage and the absence of arbitrage opportunities is fundamen- 
tal in finance. The strict arbitrage-freeness is important in the study of general 
equilibrium theory with incomplete asset markets (Husseini, Lasry & Magill 
1990; Werner 1990; and Magill & Shafer 1991). Theorem 2 to be presented in 
the following is an important step in the study of equilibrium for economies 
with general state spaces considered in our work. The principal mathematical 
tool applied here is the Separating Hyperplane Theorems of Clark (1993, 1994). 

Fact 1 ( Clark 1994 ) Suppose M and N are non-empty disjoint eonvex eones in a 
loeally eonvex topologieal veetor spaee E. Then there exists a non-zero eontinuous 
linear funetional f : E ^ TZ separating N from M: f{n) > 0 for all n E N and 
f{m) < 0 for all m e M if and only if M — N ^ E. Moreover, if M — N ^ E, 
then for any e ^ M — N we may seleet f so that /(e) > 0. 

Fact 2 (Clark 1993) Suppose M and N are non-empty eonvex eones (with ver- 
tiees at the origin) in a separating Banaeh spaee E. Then there exists a non-zero 
eontinuous linear funetional f : E ^ TZ strietly separating N from M: f{n) > 0 
for alln e N and f{m) < 0 for all m e M if and only if N C\ M — N = 9. 

Fact 1 and 2 will be used to prove Theorems 1 and 2 in Sections 3 and 4, 
Theorems 3 and 4 in Sections 6 and 7, respectively. We assume E = TZ x T, 
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which is a topological space, then E+ = 71+ x 7+ is the positive cone of E, 
which is a positive closed convex cone of E with its vertex at the origin. The 
marketed subspace 



M = {{-q^e, V6)eE\ee n^] 
is a linear subspace of the space E. 



2.1 Weakly Arbitrage-free Security Valuation Theorem 

In this section, we assume that T is a locally convex topological space. 

Proposition 1 The frictionless market (q.V) is weakly arbitraqe-free if and 
only if Mf\E+ = {0} x {V)+. 

Theorem 1 The frictionless market {q, V) is weakly arbitrage-free if and only 
if there exists a positive functional v E Tf_ satisfying q~^ = vV. 



2.2 Strictly Arbitrage-free Security Valuation Theorem 

We assume that T is a separable Banach space. We prove Proposition 2 and 
Theorem 2 by using Definition 2 of the strictly arbitrage-free frictionless market 
iQ,V). 



Proposition 2 The frictionless market (q,V) is strictly arbitrage-free if and 
only if M and E+ intersect precisely at (0, 0), that is, M n E+ = {(0, 0)}. 

Theorem 2 The frictionless market {q, V) is strictly arbitrage-free if and only 
if there exists a strictly positive functional v satisfying q~^ = vV . 



3 Frictional Security Markets 



Suppose that there are transaction costs in the trading, the coefficients V G 
[0, oo) and G [0, 1) are respectively the transaction cost rates for purchasing 
and selling the security j. Then the algebraic cost induced by (buying) a position 
6^ > 0 units of security j is q^ and the algebraic gain induced by (selling) 

a position 6^ < 0 units of security j is g-l(l — s^)6E We introduce the functions 
:7Z ^ 7Z defined by 







qHl + V)z, z>0 
q^{l — s^)z, z <0 



and the functions 4>^ : 71^ 71 defined by 




(1 -I- V)z, z >0 
(1 — s^)z, z <0. 
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Then $^z) = q^cf>>{z). 

For any integer N = 1, 2, ■ ■ a function : TZ^ TZ is subliner if, for any 
G TZ-^ , G 7?,"'^, X G TZ'^ and A G 7?,+ , 

'tl){x^ + x^) < 'tl){x^) + 'tl){x^) and ^{Xx) = Xtp{x). 

The function (jp is sublinear, and hence convex. Therefore the function is also 
sublinear, and hence convex. 

The total cost or gain induced by (trading) a portfolio 6 G TZ'^ is ~ 

q^(P{0^)- We define the function r : TZ~^ TZ by 

J J 

t(x) = q^ 4^ (x^)- 

j=i j=i 

Then the total cost or gain induced by (trading) a portfolio 6 G TZ'^ is t{6). As 
we know, the function r is sublinear, and hence convex. 

Definition 3 The frictional market {q, V, b, s) is weakly arbitrage-free if any 
portfolio 6 G TZ'^ of securities has a positive total cost or gain t{6) > 0 whenever 
it has a positive payoff V6 

We note that Definitions 2 and 2' are equivalent. However, in the presence 
of transaction costs, the correpsonding Definitions 4 and 4' (as follows) are not. 
We establish Theorem 4 for Definition 4' of the strictly arbitrage-free frictional 
market (g, V, b, s). 

Definition 4 The frictional market (g, V, 6, s) is strictly arbitrage-free if (1) 
any portfolio 6 G TZ‘^ of securities has a strictly positive total cost or gain t{6) > 

0 whenever it has a positive non-zero payoff V6 G T_|_ \ {0}; and (2) any portfolio 
6 G TZ‘^ of securities has a zero total cost or gain t{6) = 0 whenever it has a 
zero payoff V 6 = 0 . 

Definition 4' The frictional market (g, V, b, s) is strictly arbitrage-free if 
(1) any portfolio ff G TZ'^ of securities has a strictly positive total cost or gain 
t{6) > 0 whenever it has a positive non-zero payoff V6 E T+\ {0}; and (2) any 
portfolio 6 G TZ‘^ of securities has a positive total cost or gain t{6) > 0 whenever 
it has a positive payof V6 G T+. 

Definition 4 obviously implies Definition 4'. Definition 4' does not imply 
Definition 4 because of the presence of friction. In the frictionless model, we 
define the marketed subspace 

M = {{-q~^e, ve)eE\ee 

of the space E to prove the first fundamental theorems of asset pricing. In 
the frictional model, we can’t consider the corresponding marketed “subspace” 
{{—t{6),V6) E E \ 6 E TZ'^}. In fact, this marketed “subspace” isn’t a subspace 
of the space E. Instead, we define the subset M' in the space E as follows 

M' = {(r, i)EE\r< -r(6l) and t = V6 for 6 E TZ^} 




Arbitrage-Free Asset Pricing in General State Space 557 



Lemma 2 M' is a closed and convex cone in the space E. 

For simplicity, we use the following notations in the subsequent sections. 









fs^\ 


: : : 


and s = 




\lj J 


) 




{s-’ J 



We define the box product of two vectors yi G TZ'’^ and G by 



2/1 ^ 2/2 = 




3.1 Weakly Arbitrage-free Security Valuation Theorem 

We assume that T is a locally convex topological space. 

Proposition 3 The frictional market {q, V, b, s) is weakly arbitrage-free if and 
only if M' n F;+ = {0} x {V)+. 

Theorem 3 The frictional market {q, V, b, s) is weakly arbitrage-free if and only 
if there exists a positive functional v ETf_ satisfying 

q □(!— s) <vV <q □(1-|- 6) 



3.2 Strictly Arbitrage-free Security Valuation Theorem 

In this section, we assume that T is a separable Banach space. We prove the 
following Proposition 4 and Theorem 4 for Definition 4' of the strictly arbitrage- 
free frictional market (g, V, b, s). 

Proposition 4 The frictional market (g, V, 6, s) is strictly arbitrage-free if and 
only if M' and E+ intersect precisely at (0, 0), that is, M' n E+ = {(0, 0)}. 

Theorem 4 The The frictional market (g, V, 6, s) is strictly arbitrage-free if and 
only if there exists a strictly positive functional v G Tf__^_ satisfying 

q □(!— s) <vV <q □(1-|- 6) 
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Abstract. Clustering methods partition a set of objects into clusters 
such that objects in the same cluster are more similar to each other 
than objects in different clusters according to some defined criteria. In 
this paper, we present an algorithm, called tabu search fuzzy /c-modes, 
to extend the fuzzy A:-means paradigm to categorical domains. Using 
the tabu search based technique, our algorithm can explore the solution 
space beyond local optimality in order to aim at finding a global optimal 
solution of the fuzzy clustering problem. It is found that our algorithm 
performs better, in terms of accuracy, than the fuzzy /c-modes algorithm. 



1 Introduction 

Partitioning a large set of objects into homogeneous clusters is a fundamental 
operation in data science. A set of objects described by a number of attributes is 
to be classified into several clusters such that each object is allowed to belong to 
more than one cluster with different degrees of association. This fuzzy clustering 
problem can be represented as a mathematical optimization problem; 

k n 

minF(lT, Z) EE (1) 

’ /=! i=l 



subject to 

k n 

0 <vjii < 1, = 1) 0 < < ^5 l<l<k, l<i<n, (2) 

(=1 1=1 

where n is the number of objects, m is the number of attributes of each object, 
k{< n) is a known number of clusters, X = {xi, X2, ■ ■ ■ , x„} is a set of n objects 
with m attributes, Z = [zi, Z2, ..., z/^] is an m-hy-k matrix containing k cluster 
centers, W = [wu] is an k-hy-m matrix and d{zi,Xi){> 0) is some dissimilarity 
measure between the cluster center z; and the object Xj. 

The above optimization problem was first formulated by Dunn[2]. A widely 
known approach to this problem is the fuzzy fc-means algorithm which was pro- 
posed by Ruspini [3] and Bezdek[4]. The fuzzy fc-means algorithm is efficient in 
clustering large data sets. The fuzzy /c-means algorithm is initiated by selecting 
a value for W, then the algorithm iterates between computing cluster centers, 
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Z, given W and computing W, given Z. The algorithm terminates when two 
successive values of TT or Z are equal. It has been shown that the fuzzy A;-means 
algorithm converges [4,5,10,11]. However, the algorithm may stop at a local 
minimum of the optimalization problem. This is because the function F(Z, W) 
is non- convex in general. 

To obtain the global optimal solution of combinatorial optimization prob- 
lems, tabu search based techniques which were introduced by Glover [6] is ap- 
plied. Tabu search based techniques are concerned with imposing restrictions 
to guide a search process to negotiate otherwise difficult regions. The search 
procedures do not immediately terminate for a local optimal solution, but in- 
stead the procedures attempt to search beyond the local optimality in order to 
get the global optimal solution. Al-Sultan and Fedjki [1] have proposed a tabu 
search based algorithm for the the fuzzy clustering problem. Their proposed 
tabu search based algorithm has been found to outperform the fuzzy fc-means 
algorithm considerably in their tests. 

However, the fuzzy fe-means algorithm only works on numeric data which 
limits the use in clustering where large categorical data sets are frequently en- 
countered. To deal with categorical data sets, Huang [8], and Huang and Ng 
[9] suggested the fuzzy fe-modes algorithm. This algorithm extends the fc-means 
algorithm by applying a simple matching dissimilarity measure for categorical 
objects and using modes instead of means for clusters. The main aim of this pa- 
per is to develop tabu search based fuzzy fc-modes algorithm to obtain a global 
solution of the fuzzy categorical data clustering problem. 

The outline of the paper is as follows. In Section 2, the fuzzy fc-modes algo- 
rithm is briefly reviewed. In Section 3, tabu search based techniques are intro- 
duced and the new clustering algorithm is proposed. In Section 4, the numerical 
results are presented to illustrate the effectiveness of our new approach. 



2 Fuzzy fc-Modes Algorithm 



The fuzzy fc-modes algorithm is modihed from the fc-means algorithm by using 
a simple matching dissimilarity measure for categorical data, and replacing the 
means of clusters with the modes. These modifications removes the numeric-only 
limitation of the fc-means algorithm while maintains its efficiency in clustering 
categorical data sets. The simple matching dissimilarity measure between z; and 
Xi, for Z = 1, 2, ..., fc and z = 1, 2, ..., n, is defined as: 



where z; 



dc(zi,Xi) = y^6{zij,Xij) 

j=i 



[zii,- ■ ■ ,zimY' and Xi 







r 0, if Zij = Xij 

1 , if Z[j ^ Xij 



( 3 ) 

( 4 ) 



Minimization of F in (I) with the simple matching dissimilarities and the con- 
straints in (2) forms a class of constrained nonlinear optimization problems whose 
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solution is unknown. The usual method towards optimization of F in (1) is to 
use partial optimization for Z and W. In this method we first fix Z and find 
necessary conditions on W to minimize F. Then we fix W and minimize F with 
respect to Z. This process is formalized in the fuzzy fc-modes algorithm. 

The matrices W and Z are formulated in the following way. Let Z be hxed, 
i.e., z; (I = 1, 2, ..., k) are given, we can find W by: 



Wli = 



1 , 

0 , 

1/Eti 



de(Zi,Xj) °-l 
dc(zh,Xi) 



if Xi = Z; 

if Xi = Zh but h 7 ^ I 

if Xi 7 ^ z; and x^ 7 ^ z;, , 1 < h < fc 



(5) 



for 1 < I < k,l < i < n. Let W be fixed, we can find Z by the fc-modes 
update method. Let T be a set of categorical objects described by m categorical 
attributes Ai, A 2 , ..., Am- Each attribute Aj has rij categories: ..., 

for 1 < j < m. Let the f-th cluster center be z; = [zn, Z 12 , ■■■, zim]'^ ■ Then 
F(W, Z) is minimized if and only if 

zij = where ^ ^ w,“, 1 < t < n^. (6) 

(r) . (t) 

i,Xij=aj t,Xij=a\ 



However, the fuzzy fc-modes algorithm may only stop at a local optimal 
solution of the clustering problem. This means that the solution obtained can 
still be further improved. Therefore, tabu-search techniques are incorporated in 
order to find the global optimal solution of the optimization problem (1). 



3 Tabu Search Based Categorical Data Clustering 

Tabu search method is based on procedures designed to cross boundaries of 
feasibility or local optimality, which were usually treated as barriers, and sys- 
tematically to impose and release constraints to permit exploration of otherwise 
forbidden regions. Tabu search is a meta-heuristic that guides a local heuristic 
search procedure to explore the solution space beyond local optimality. A fun- 
damental element underlying tabu search is the use of flexible memory. A chief 
mechanism for exploiting memory in tabu search is to classify a subset of the 
moves in a neighborhood as forbidden or tabu. 

Our new algorithm in Table 1 is to combine the fuzzy fc-modes algorithm and 
the tabu search techniques in order to find the global optimal solution of the 
clustering problem of categorical data. In our algorithm, equation (5) is used to 
update the fuzzy partition matrix W . But we do not use equation (6) to update 
the cluster center Z. Instead Z is generated by the below method and is mapped 
into a value for the objective function value. This techniques has been used by 
Al-Sultan and Fedjki [1]. 

Let Z*, Z^ denote the trial, current and best cluster centers, and F* , F^, F^ 
denote the corresponding trial, current and best objective function values respec- 
tively. A number of trial cluster centers Z* are to be generated through moves 
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from the current cluster centers As the algorithm proceeds, the best cluster 
centers found so far is saved in The corresponding objective function values 

_F“, are also operated respectively. One of the most distinctive features of 
tabu search is the generation of neighborhoods. Since numerical data sets have 
naturally ordering, the neighborhood of the center z“ is defined as follows: 

= {y = [yi,V 2 ,-- ■ ,VmV' \vi = + (j)d, f = 1, 2, • • • , m, d = 0, -1 or +1}. 

(7) 

We note that when z“ is close to the solution, a small step-size 4> can be used. 
The neighbors of z“ can be generated by picking randomly from N{tF). 

There are two kinds of categorical attributes, namely, ordinal and nominal. 
Ordinal attributes do have ordered levels, such as size and education levels. 
Their neighborhoods can be defined similarly as in (7) for numerical data sets. 
However, this approach cannot be applied to categorical data sets with nominal 
attributes since they do not have naturally ordering. In this paper, we propose to 
use the “distance” concept to make moves from the cluster center for categorical 
data sets. The neighborhood of z“ is defined as follows: 

= {y = ,VmV I 4(y,z“) < d}, (8) 

for some positive integers d. In our algorithm, we generate a set of neighbors 
which are of a certain distance d from the center, i.e., neighbors which have d 
attributes different from the center. 

4 Experimental Results 

The tabu search-based categorical clustering algorithm is coded in C-f-l- pro- 
gramming language. The data set is the soybean disease data set [9] . We choose 
this data set to test these algorithms because all attributes of the data can be 
treated as categorical. The soybean data set has 47 records, each being described 
by 35 attributes. Each record is labelled as one of the 4 diseases: Diaporthe Stem 
Canker, Charcoal Rot, Rhizoctonia Root Rot, and Phytophthora Rot. Except 
for Phytophthora Rot which has 17 records, all other diseases have 10 records 
each. Of the 35 attributes we only selected 21 because the other 14 have only 
one category. 

We use the fuzzy fc-modes and tabu search based fc-modes clustering algo- 
rithms to cluster this data set into 4 clusters. The initial modes are randomly 
selected k distinct records from the data set. For the fuzzy fc-modes algorithm 
we specify a = 1.1. We obtain the cluster memberships from W as follows. The 
record Xi is assigned to the 1th cluster if wu = maxi</i<fc{w^i}. If the max- 
imum is not unique, then Xi is assigned to the cluster of first achieving the 
maximum. A clustering result is measured by the clustering accuracy r defined 

as r = — where a/ is the number of instances occurring in both cluster I 

and its corresponding class and n is the number of instances in the data set. 

Each algorithm is run 100 times. We select values for 7=0.75, T’=0.97, d=3 
and IMAX=100 for tabu search based fc-mode clustering algorithm. Moreover, 
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Fig. 1. Average clustering accuracy and objective function values. 



the tabu list size is 100, the number of trial solutions is 50 and the probability 
threshold P is 0.97. It is found that the classification accuracy of the algorithm 
is very high. The average accuracy is about 99% and the number of runs that 
all records are correctly clustered into the 4 given clusters is 67. In Table 2, 
we compare the average accuracy of clustering and the number of runs with all 
correct classifications by using fuzzy fc-modes and tabu search based fc-modes 
algorithms. 

Next we test different sets of parameters of tabu search based fc-modes clus- 
tering algorithm. For each set of parameters, the algorithm is run 100 times. 
Figure 1 shows the relationship between the average clustering results and the 
average objective function values. We see that the average objective function 
values with high classification accuracy is less than those with low classification 
accuracy. This relationship indicates that we use the objective function values to 
choose a good clustering result if the original classification of data is unknown. 

Finally, we report that the computational time at each step of tabu search 
based fc-modes clustering algorithm taken increases linearly as either one of the 
parameters: the number of objects, the number of attributes, the size of tabu list 
or the number of trial solutions increases. Thus the tabu search based fc-modes 
algorithm is efficient and effective for clustering categorical data sets. 
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Tabu Search Based Categorical Clustering Algorithm: 

Step 1: Initialization 

Let Z'^ be arbitrary centers and F'^ the corresponding objective function value. Let = Z'^ 
and = F^ . Select values for NTLM (tabu list size), P (probability threshold), NH 
(number of trial solutions), IMAX (the maximum number of iterations for each center), 
and 7 (the iteration reducer). Let h = 1, NTL = 0 and r = 1. Go to Step 2. 

Step 2: 

Using fix all centers and move center by generating NH neighbors z^,Z 2 , 

and evaluate their corresponding objective function values F^, F^, ..., Pnh- Step 3. 

Step 3: 

(a) Sort F^ , * = 1, •••, NH in a nondecreasing order and denote them as •••5 ^[nh]‘ 

Clearly F^^-^ < ... < Let e = 1. If > F^ , then replace h hy h 1. Goto Step 3(b). 

(b) If 2 [e] is not tabu or if it is tabu but F^^^ < F^, then let z^ = Z[e] and F~^ = F^^-^ and go to 
step 4. Otherwise generate u ~ U{0, 1) where U{0, 1) is a uniform density function between 
0 and 1. If F^ < F^^^ < F~^ and u > P, then let = Z[g] and F'^ = and go to Step 4; 
otherwise, go to Step 3(c). 

(c) Check for the next neighbor by letting c = c + 1. If e < NH, go to step 3(a). Otherwise 
go to step 3(d). 

(d) If h > IMAX, then go to step 5. Otherwise select a new set of neighbors by go to step 2. 
Step 4: 

Insert z^ at the bottom of the tabu list. If NTL = NTLM, then delete the top of the tabu list; 
otherwise let NTL = NTL + 1. If F’’ > F“, then let F*’ = F“ and Z’’ = Z“. Go to step 3 (4). 

Step 5: 

If r < A:, then let r = r + 1 and reset h = 1 and go to Step 2. Otherwise set IMAX = 

■y{I M AX). If IMAX > 1, then let r = 1 and reset h = 1 and go to step 2; otherwise stop. 

{Z^ represents the best centers and F^ is the corresponding best objective function value). 



Table 1. Tabu search based categorical clustering algorithm. 





average accuracy 


number of runs that r = 1 


Fuzzy fc-mode 


0.790 


20 


Tabu search based fc-mode 


0.991 


67 



Table 2. Clustering accuracy. 
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Abstract In the context of unsupervised clustering, lots of different al- 
gorithms have been proposed. Most of them consist in optimizing an ob- 
jective function using a search strategy. We present here a new methodol- 
ogy for studying and comparing the performances of the objective func- 
tions and search strategies employed. 



1 Introduction 

Unsupervised clustering is an important tool of data mining whose goal is to 
synthetize a huge amount of data by a small number of homogenous and distinct 
classes. Those classes form a partition of the set of objects and summarize at 
best the similarity between them: each cluster contains the set of objects the 
most similar two by two and the objects the most dissimilar belong to different 
classes. Those methods are useful for at least two reasons. First, they provide a 
summary of the dataset which is of human size and secondly they can be used 
as a pre-processing step to reduce the cost of subsequent treatments. On the 
contrary to supervised classification methods, the construction of the partition 
is not guided by a known class variable. This point constitutes an important 
difficulty since no a priori or external reference is available. The goal partition 
is not necessarily unique (this depends on the chosen measure) and there is no 
consensual external criterion for the evaluation of the quality of a solution. 

Since the beginning of the sixties, a lot of different algorithms have been pro- 
posed and lead to distinct results. Three main characteristics distinguish these 
algorithms. First is the way they defined -in concrete terms- the similarity be- 
tween object pairs. Most of algorithms use a distance on the descriptive vectors 
[1] (such as the Mahalanobis, the Gaussian or the Euclidean one) based on metric 
and separation properties of the underlying topological space R" . Other methods 
compare probabilities vectors associated to each value of the nominal variables 
on each class [2]. This allows for instance to search classes in which objects 
share the same value on most of the variables. A second characteristic of the 
methods is the objective function used to evaluate the relevance of a partition. 
This function is often a compromise between the intra-cluster similarity and the 
difference between clusters. Because of the combinatorial number of different 
partitions and overall the absence of structure between those partitions, none 
algorithm provides an exhaustive search and thus, it is commonly assumed that 
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local search is preferabled. This optimization methodology constitutes the third 
characteristic of clustering algorithms. Those later ones are generally presented 
as universal methods. That is why there is not much information for the cir- 
cumstances in which they can be really used. Nevertheless, several comparative 
studies provide results on the behavior of various algorithms on the same data 
set. Those studies are very useful in order to evaluate the quality of a method. 
But, they are not able to explain which part of the algorithm is the cause of the 
difference of the results. Are we sure that the objective function distinguishes 
well the partitions ? Which of several search strategies is the best one ? Those 
questions are not trivial and their answers can lead to a better understanding of 
the algorithms. 

In this paper we propose a new methodology for studing the behavior of 
the objective functions of unsupervised clustering methods. This methodology is 
based on the construction of an order of all the partitions independently of the 
objective functions studied and is presented in details in the following section. 
Then, we present several objective functions and compare them through our 
protocol. The results show that this protocol is well discriminant. 

2 Evaluating partitions 

A way to approach the behavior of an objective function on a set of partitions, is 
to determine a total order on this set. The main advantage of such an approach 
is first to be independent of the function and second to have to contain some a 
priori knowledge on the subjective notion of “good” partition. However, there is 
no natural total order. That is why we propose the following methodology. First, 
we design a data set such that the goal partition is obvious and fix this partition 
as a reference denoted by Pq. Then we use a distance d and compare Pi and 
P 2 using their respective distance towards Pq. This distance had to take into 
account two characteristics. In order to compare the quality of both partitions, 
the distance needs to consider the similarity of the clusters on the point of view of 
the variable description. However, this is not sufficiently discriminant especially 
regarding the objects in the clusters, that is why it has also to consider the 
similarity on the point of view of the objects. 



2.1 A distance taken on the variables 

We compute the Pi distance between probabilities vectors associated to each 
modality of all the variables in each cluster. Let X be a finite set of objects 
described by p variables with m modalities each. T (X) is the set of all sub- 
sets of X. At each set belonging to P (X) a probabilistic vector of length 
(p X to) is associated. The following normalized distance compares two such 
vectors: VC's,, Ck’ £ P (X), 

pm 

l^v{Ck, Ck') — {Ai = Vij I Ck) -P{Ai = Vij I Ck')\ 
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with P {Ai = Vij I Cfe) the conditional probability that objects in cluster Ck take 
the modality Vij on the variable Ai. 



2.2 A distance taken on the objects 

For the comparison of two clusters on the point of view of the objects, we used 
the Marczewski and Steinhaus distance [3] proposed for the comparison of two 
sets. This distance is based on the symmetric difference taken on two sets (i.e. the 
number of objects which belong to only one of the both sets) and normalized by 
the cardinality of the union of the sets. In the following, |.| denotes the cardinality 
of a set. The distance between two elements of V (X) is: 



Me> (C'fe, C'fe') 



C^ACy 

\Cl,UCy 

0 



if |Cfc U Ck' I > 0 
otherwise 



with A the symmetric difference. As CkACk' ^ Ck U Ck', this distance is 
normalized, taking the value 1 when CkC]Ck' = 0 and the value 0 when Ck = Ck' ■ 



2.3 A Hausdorff like distance between two partitions 

We can compare two partitions (using a measure for evaluating the proximity 
between clusters) with a Hausdorff like distance [4] . This distance allows to com- 
pare all couples of partitions of a same set, even if they have different numbers 
of clusters. Let Pi and P 2 be two partitions such that Pi = {Cn, C 12 , ..., C'lii} 
and P2 = {C21, C22, ..., C'2/2}- Given a measure ji between two sets, we construct 
a distance between Pi and P2 as the following, 

P„ (Pi, P2) = ^ max min ji {Cu, C^j) + max min p {Cu, C^j) 

Z iG/i jG/2 j^l2 

This distance is based on the principle of the worst case, that is to say for all 
the clusters of the first partition, we search the closest cluster of the second one, 
and we hold back only the worst case of those couples. We then symmetrize the 
result. 

The both distances obtained with the comparison of clusters on the point 
of view of the variables and of the objects can be mixed up by the Euclidean 
distance thanks to common normalization, 

Pmin - max (Pi, P2) = (Pl,P 2 )+p 2 ^ P 2 ) 

Let Pi and P 2 be two partitions. We define the < relation as, 

^1 ^ P 2 ^min — max (n 5 ^ ^min — max (P0,P2) 

It is obvious that two different partitions can be at the same distance to Pq and 
thus be exaequo in the total order. 
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2.4 An improved measure 



The principle of the Hausdorff like distance is very interesting, but we can sup- 
pose that its behavior would be insufEciently discriminant on the set of all parti- 
tions. Indeed, this measure takes into account only the worth associated couple 
of clusters between the both partitions. In order to overcome this drawback, 
we propose another distance between two partitions. This measure consists in 
searching for each cluster in one partition the closest cluster of another parti- 
tion, not already associated to a cluster. Moreover, we associate a cost to those 
associations through or . We then simply sum the values over all the 
associations. But, as we also want the maximum number of associations between 
clusters, a non association must be penalized since it corresponds to a compari- 
son of subsets of the partitions and not the whole partitions. All of this can be 
express simply through the graph theory approach. 

A graph G is a set of vertices V and a set of edges E C V xV. The elements 
of V are the clusters of the partitions to be compared. Let us remark that 
V = Vi U V2, with Vi corresponding to Pi and V2 corresponding to P2 so we 
restrict the edges E to be of the form {v\,V2) with v\ G V\ and V2 G V2. This 
corresponds to a bipartite graph. This graph is complete and all the edges are 
weighted as previously mentioned. The problem to solve is then to find the 
matching of maximum cardinality and minimum weight. 



C'{P,,P2) = 



min(|Pi|, IP2I) 



veM 



M G max \N\ 
Nec^(Pi,P2) 



with Cp(Pi,P2) the set of every matching between Pi and P2. This distance 
has many advantages. It considers partitions in their whole since one association 
is done relatively to the others. It penalizes bad associations but weakly as in 
the previous distance based on worst case. Thus, we attempt to have better 
sensitivity on partition variations. Moreover, this approach is efficient since it 
has a quadratic complexity and thus is tractable even in the case of large sets 
[5]. 



2.5 The methodology 

To compare several objective functions we first have to design a synthetic prob- 
lem in which the goal partition is known. Then we define a subset of partitions 
on which the objective functions are studied. The distance between each parti- 
tion and the reference defines an order. Let us remark that an ordering permits 
also to study the behavior of the optimization procedure through its walk on 
the graph of the function which associates to each partition the value of the 
objective function. This is currently under study. 

3 Experimentation 

The protocol 

We compare three objective functions. The first one is used in the well known 
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conceptual clustering algorithm COBWEB [6] and called category utility. It is 
a trade-off between intra-class similarity and inter-class dissimilarity of objects, 
where objects are described by nominal variables. It is equal to the weighted 
average of the Gini entropy of the variable distribution. The second function is 
Quinlan’s gain ratio, generally used in supervised methods and, as suggested in 
[2], adapted to unsupervised clustering. This function is defined by the differ- 
ence between the entropy of the variables conditionned by the partition and the 
entropy of the variables, divided by the entropy on the class variable. The third 
one is the Lopez de Mantaras normalized information gain, which is a slight 
different normalization of the Quinlan’s gain ratio [7]. 

To evaluate the behavior of those functions, we construct synthetic datasets 
made of k subsets of variables and k subsets of objects resulting in a block 
diagonal Boolean matrix. We design two such matrices: 8 objects x 8 variables 
and 60 objects x 15 variables to simulate realistic datasets. It is possible to 
enumerate all the partitions of a set of 8 objects, but this become unpracticable 
for 60 objects. Consequently, on the first dataset we compute the exhaustive set 
of partitions, but on the second one we extract 30 000 partitions randomly. We 
also introduce some noise by random permutations in the dataset matrix. 

Some results 

On figure 1 left, we plot the distances - matching index and Hausdorff distance 
-as * values and Quinlan’s values as y axis. As we expected, the matching 
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Figurel. Unnoisy case (distances in abscissa): (left) distances comparison (right) mea- 
sures comparison. 



index is far more discriminant than the Hausdorff like distance. Notice that we 
observe similar results for all the other measures and noise levels. We noticed 
that the worst case approach has the default to be invariant in a quite large 
set of partitions, being unsufficiently discriminant. We also compare the Quinlan 
gain ratio with the CU measure (see figure 1 on the right). It seems that the 
variations of CU are too small that nearly all partitions seem to be similar 
for the measure, except the extremal one. Following these preliminary results, 
Quinlan measure can be considered as a better measure than CU, however more 
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experiments are necessary to conclude. To simulate a real case, we introduce 
some noise in the boolean matrix (see figure 2). Quinlan measure appears to be 
more noise resistant than CU. With a 5 percent noise level, it behaves like in 
the ideal case. When noise increases, some partitions take aberrant values (see 
figure 2 (left)). However, this measure remains regular when CU becomes very 
perturbated (see figure 2 (right)). 
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Figure2. Noise influence versus distance; (left) Quinlan gain ratio - (right) CU 



4 Conclusion 

We have presented a new methodology for ordering partitions to objectively 
compare the behavior of different quality measures used in unsupervised learning. 
Our methodology is independent of the studied measures, has a polynomial 
complexity and also permits to study the optimization procedure of various 
unsupervised clustering. Some works have been done in this way and will be the 
subject of a forthcoming article. 
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